Create Chat Completions - DeepSeek v3.1 Thinking Control (Streaming)

This endpoint is used to create chat conversations with the DeepSeek v3.1 model. It supports streaming output via Server-Sent Events (SSE) and fine-grained control over the model's depth of reasoning (Thinking).

Endpoint Details

  • Status: Released
  • Method: POST
  • URL: https://api.codingplanx.ai/v1/chat/completions

Request Headers

ParameterRequiredTypeExampleDescription
Content-TypeYesStringapplication/jsonData format
AcceptYesStringapplication/jsonAccepted response format
AuthorizationNoStringBearer {{YOUR_API_KEY}}Authentication credential (usually required for actual API calls)

(Note: The X-Forwarded-Host header is currently disabled)


Request Body

Content-Type: application/json

ParameterRequiredTypeDescription
modelYesStringID of the model to use. Example: deepseek-v3-1-250821
messagesYesArray of ObjectsA list of messages comprising the conversation.
? roleYesStringThe role of the messages author (e.g., system, user, assistant).
? contentYesStringThe contents of the message.
max_tokensNoIntegerThe maximum number of tokens to generate in the completion. The total length of input and output tokens is limited by the model's context length.
temperatureNoNumberSampling temperature, between 0 and 2. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic.
streamNoBooleanIf set to true, partial message deltas will be sent as a stream using Server-Sent Events (SSE). The stream is terminated by a data: [DONE] message.
stream_optionsNoObjectOptions for streaming response. Only applicable when stream is set to true.
? include_usageNoBooleanIf set to true, an additional chunk will be streamed before the data: [DONE] message. The usage field on this chunk shows the token usage statistics for the entire request.
thinkingNoObjectFor models supporting depth of reasoning, this field controls whether the thinking capability is enabled.
? typeNoStringenabled: Forcefully enable thinking.<br>disabled: Forcefully disable thinking.<br>auto: Allow the model to decide.

Request Example

{
  "model": "deepseek-v3-1-250821",
  "max_tokens": 1000,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello"
    }
  ],
  "temperature": 1.0,
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "thinking": {
    "type": "enabled"
  }
}

Response

HTTP Status Code: 200 OK Content-Type: application/json

(Note: When stream: true, the response format is an SSE stream, where each data chunk is a JSON string with the following structure. Below is the standard JSON structure for non-streaming or fully assembled streaming requests.)

FieldTypeDescription
idStringA unique identifier for the chat completion.
objectStringThe object type, which is always chat.completion or chat.completion.chunk.
createdIntegerThe Unix timestamp (in seconds) of when the chat completion was created.
choicesArray of ObjectsA list of chat completion choices.
? indexIntegerThe index of the choice in the list of choices.
? messageObjectA chat completion message generated by the model (usually delta in streaming requests).
   ? roleStringThe role of the author (usually assistant).
   ? contentStringThe contents of the generated message.
? finish_reasonStringThe reason the model stopped generating tokens (e.g., stop, length).
usageObjectUsage statistics for the completion request.
? prompt_tokensIntegerNumber of tokens in the prompt.
? completion_tokensIntegerNumber of tokens in the generated completion.
? total_tokensIntegerTotal number of tokens used in the request (prompt + completion).

Response Example (JSON)

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\r
\r
Hello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

Frequently Asked Questions (FAQs)

Q1: How can I get token usage statistics in a streaming (SSE) request? A1: You need to set stream to true in the request body and configure stream_options: {"include_usage": true}. By doing this, the server will push an additional data chunk right before sending the data: [DONE] termination marker. The choices array in this chunk will be empty, but the usage field will contain the complete token statistics for the entire request.

Q2: What is the difference between enabled and auto in the thinking parameter? A2:

  • enabled: Forces the model to perform "Chain of Thought" reasoning before returning the final answer. This generally yields higher-quality reasoning results but may increase response time (Time to First Token) and output token consumption.
  • auto: Delegates the decision to the model. The model will automatically determine whether deep thinking is necessary based on the complexity of the questions in your messages.

Q3: Why did the request return a 401 Unauthorized error? A3: This usually indicates an authentication failure. Please check if the Authorization field in your Request Headers is formatted correctly as Bearer {{YOUR_API_KEY}}, and ensure that your API Key is valid and has not expired.

Q4: Will the thought process content be returned if thinking is enabled? A4: According to standard OpenAI-compatible format implementations, if streaming output is used, the thought process content may be returned within specific identifiers (such as <think>...</think> tags) or via a specific chunk field. When parsing the frontend streaming data, please make sure to handle the reasoning process within the content so you can accurately display a "Thinking..." UI effect to the user.

Q5: What does it mean if I encounter finish_reason: "length"? A5: This indicates that the model's output has reached the max_tokens limit you set, or it has hit the maximum context window limit of the model itself, causing the response to be truncated. If the answer is incomplete, you can try increasing the max_tokens value.