Control Reasoning Effort (Chat Completions)

Interface Description

Given a prompt, the model will return one or more predicted completions (supporting control over the reasoning depth/effort of the model).

Request Specifications

  • Method: POST
  • Endpoint: https://api.codingplanx.ai/v1/chat/completions

Request Headers

ParameterRequiredTypeExampleDescription
Content-TypeYesstringapplication/jsonData format
AcceptYesstringapplication/jsonResponse format
AuthorizationNostringBearer {{YOUR_API_KEY}}Authentication credentials

Request Body

Data Format: application/json

ParameterRequiredTypeDescription
modelYesstringID of the model to use (e.g., o4-mini, etc.).
messagesYesarrayA list of messages comprising the conversation so far. Includes role (e.g., user/assistant) and content.
toolsYesarrayA list of tools (functions) the model may call. Used to provide functions for which the model can generate JSON inputs.
tool_choiceYesobjectControls which function is called by the model. none means no call, auto means automatic selection.
reasoning_effortNostringCore Parameter: Controls the level of effort (depth of thought) the reasoning model spends on the reply. Common values: low, medium, high.
temperatureNonumberSampling temperature between 0 and 2. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic. Recommended to modify either this or top_p, but not both.
top_pNonumberNucleus sampling. 0.1 means only tokens comprising the top 10% probability mass are considered. Recommended to modify either this or temperature, but not both.
nNointegerDefaults to 1. How many chat completion choices to generate for each input message.
streamNobooleanDefaults to false. If true, partial message deltas will be sent as Server-Sent Events (SSE), terminated by data: [DONE].
stopNostringUp to 4 sequences where the API will stop generating further tokens.
max_tokensNointegerThe maximum number of tokens to generate in the completion. Total length is limited by the model's context length.
presence_penaltyNonumberNumber between -2.0 and 2.0. Positive values penalize new tokens based on whether they have appeared so far, increasing the likelihood of talking about new topics.
frequency_penaltyNonumberNumber between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency, decreasing the likelihood of repeating the same lines.
logit_biasNoobjectAccepts a JSON object that maps token IDs to bias values (-100 to 100) to modify the likelihood of specified tokens appearing.
userNostringA unique identifier representing your end-user, helping to monitor and detect abuse.
response_formatNoobjectSpecifies the format the model must output. For example, {"type": "json_object"} enables JSON mode.
seenNointegerBeta feature (similar to seed). If specified, the system will make a best effort to sample deterministically so that repeated requests with the same parameters return the same result.

Request Example

{
  "model": "o4-mini",
  "max_tokens": 500,
  "messages": [
    {
      "role": "user",
      "content": "Hello, please explain the principles of quantum computing in detail."
    }
  ],
  "temperature": 1.0,
  "stream": false,
  "reasoning_effort": "medium"
}

Response Specifications

Response Body Parameters

Data Format: application/json

ParameterTypeDescription
idstringA unique identifier for the request.
objectstringThe object type, usually chat.completion.
createdintegerThe Unix timestamp of when the completion was created.
choicesarrayA list of completion choices.
indexintegerThe index of the choice in the list.
messageobjectThe message generated by the model. Contains role and content.
finish_reasonstringThe reason the model stopped generating (e.g., stop, length).
usageobjectToken usage statistics.
prompt_tokensintegerNumber of tokens in the prompt.
completion_tokensintegerNumber of tokens in the generated completion.
total_tokensintegerTotal number of tokens used in the request.

Response Example (HTTP 200 OK)

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\r
\r
Hello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

FAQs

1. What is the purpose of the reasoning_effort parameter?

reasoning_effort is used to control how much computational resource and time a model with advanced reasoning capabilities (such as OpenAI's o1/o4 series) spends on "internal thinking" before providing a final answer. Typically, it supports three levels: low, medium, and high. Setting it to high allows the model to perform deeper logical derivation, making it suitable for complex math or coding problems, though response latency will increase.

2. Why don't I receive a standard JSON response when stream: true is enabled?

When streaming is enabled (stream: true), the API uses the Server-Sent Events (SSE) protocol to return data chunks continuously. Each chunk is sent as data: {...}, and a data: [DONE] message is sent when the transmission is complete. Developers must read the stream line-by-line and parse the JSON strings following the data: prefix to reconstruct the full response.

3. The documentation suggests modifying either temperature or top_p, but not both. Why?

Both parameters control the randomness and diversity of the model's output. temperature changes the probability distribution by scaling logits, while top_p (nucleus sampling) limits the candidate pool by truncating low-probability cumulative tokens. Adjusting both simultaneously makes the model's behavior difficult to predict and control. The best practice is to keep one at its default value and adjust the other.

4. What does it mean if finish_reason returns "length"?

This indicates that the model's response was forcibly truncated. This usually happens for two reasons: either the number of generated tokens reached the max_tokens limit set in the request, or the total tokens (prompt + completion) exceeded the model's maximum context window limit. In such cases, consider shortening the conversation history or increasing the max_tokens parameter.

5. How can I force the model to return data in JSON format?

You can enable JSON mode by passing "response_format": {"type": "json_object"} in the request body. Extremely Important: When enabling this mode, you must also explicitly instruct the model to output JSON via natural language in the messages (usually in the system or user prompt). Otherwise, the model might generate an infinite sequence of whitespace until it hits the token limit.