Create Chat Completions - DeepSeek v3.1 Thinking Control (Streaming)

This endpoint is used to create chat conversations with the DeepSeek v3.1 model. It supports streaming output via Server-Sent Events (SSE) and fine-grained control over the model's depth of reasoning (Thinking).

Endpoint Details

Status: Released
Method: POST
URL: https://api.codingplanx.ai/v1/chat/completions

Request Headers

Parameter	Required	Type	Example	Description
`Content-Type`	Yes	String	`application/json`	Data format
`Accept`	Yes	String	`application/json`	Accepted response format
`Authorization`	No	String	`Bearer {{YOUR_API_KEY}}`	Authentication credential (usually required for actual API calls)

(Note: The X-Forwarded-Host header is currently disabled)

Request Body

Content-Type: application/json

Parameter	Required	Type	Description
`model`	Yes	String	ID of the model to use. Example: `deepseek-v3-1-250821`
`messages`	Yes	Array of Objects	A list of messages comprising the conversation.
? `role`	Yes	String	The role of the messages author (e.g., `system`, `user`, `assistant`).
? `content`	Yes	String	The contents of the message.
`max_tokens`	No	Integer	The maximum number of tokens to generate in the completion. The total length of input and output tokens is limited by the model's context length.
`temperature`	No	Number	Sampling temperature, between 0 and 2. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic.
`stream`	No	Boolean	If set to `true`, partial message deltas will be sent as a stream using Server-Sent Events (SSE). The stream is terminated by a `data: [DONE]` message.
`stream_options`	No	Object	Options for streaming response. Only applicable when `stream` is set to `true`.
? `include_usage`	No	Boolean	If set to `true`, an additional chunk will be streamed before the `data: [DONE]` message. The `usage` field on this chunk shows the token usage statistics for the entire request.
`thinking`	No	Object	For models supporting depth of reasoning, this field controls whether the thinking capability is enabled.
? `type`	No	String	`enabled`: Forcefully enable thinking.<br>`disabled`: Forcefully disable thinking.<br>`auto`: Allow the model to decide.

Request Example

{
  "model": "deepseek-v3-1-250821",
  "max_tokens": 1000,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello"
    }
  ],
  "temperature": 1.0,
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "thinking": {
    "type": "enabled"
  }
}

Response

HTTP Status Code: 200 OK Content-Type: application/json

(Note: When stream: true, the response format is an SSE stream, where each data chunk is a JSON string with the following structure. Below is the standard JSON structure for non-streaming or fully assembled streaming requests.)

Field	Type	Description
`id`	String	A unique identifier for the chat completion.
`object`	String	The object type, which is always `chat.completion` or `chat.completion.chunk`.
`created`	Integer	The Unix timestamp (in seconds) of when the chat completion was created.
`choices`	Array of Objects	A list of chat completion choices.
? `index`	Integer	The index of the choice in the list of choices.
? `message`	Object	A chat completion message generated by the model (usually `delta` in streaming requests).
? `role`	String	The role of the author (usually `assistant`).
? `content`	String	The contents of the generated message.
? `finish_reason`	String	The reason the model stopped generating tokens (e.g., `stop`, `length`).
`usage`	Object	Usage statistics for the completion request.
? `prompt_tokens`	Integer	Number of tokens in the prompt.
? `completion_tokens`	Integer	Number of tokens in the generated completion.
? `total_tokens`	Integer	Total number of tokens used in the request (prompt + completion).

Response Example (JSON)

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\r
\r
Hello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

Frequently Asked Questions (FAQs)

Q1: How can I get token usage statistics in a streaming (SSE) request? A1: You need to set stream to true in the request body and configure stream_options: {"include_usage": true}. By doing this, the server will push an additional data chunk right before sending the data: [DONE] termination marker. The choices array in this chunk will be empty, but the usage field will contain the complete token statistics for the entire request.

Q2: What is the difference between enabled and auto in the thinking parameter? A2:

enabled: Forces the model to perform "Chain of Thought" reasoning before returning the final answer. This generally yields higher-quality reasoning results but may increase response time (Time to First Token) and output token consumption.
auto: Delegates the decision to the model. The model will automatically determine whether deep thinking is necessary based on the complexity of the questions in your messages.

Q3: Why did the request return a 401 Unauthorized error? A3: This usually indicates an authentication failure. Please check if the Authorization field in your Request Headers is formatted correctly as Bearer {{YOUR_API_KEY}}, and ensure that your API Key is valid and has not expired.

Q4: Will the thought process content be returned if thinking is enabled? A4: According to standard OpenAI-compatible format implementations, if streaming output is used, the thought process content may be returned within specific identifiers (such as <think>...</think> tags) or via a specific chunk field. When parsing the frontend streaming data, please make sure to handle the reasoning process within the content so you can accurately display a "Thinking..." UI effect to the user.

Q5: What does it mean if I encounter finish_reason: "length"? A5: This indicates that the model's output has reached the max_tokens limit you set, or it has hit the maximum context window limit of the model itself, causing the response to be truncated. If the answer is incomplete, you can try increasing the max_tokens value.