Create Chat Completions - DeepSeek v3.1 Thinking Control (Streaming)
This endpoint is used to create chat conversations with the DeepSeek v3.1 model. It supports streaming output via Server-Sent Events (SSE) and fine-grained control over the model's depth of reasoning (Thinking).
Endpoint Details
- Status: Released
- Method:
POST - URL:
https://api.codingplanx.ai/v1/chat/completions
Request Headers
| Parameter | Required | Type | Example | Description |
|---|---|---|---|---|
Content-Type | Yes | String | application/json | Data format |
Accept | Yes | String | application/json | Accepted response format |
Authorization | No | String | Bearer {{YOUR_API_KEY}} | Authentication credential (usually required for actual API calls) |
(Note: The X-Forwarded-Host header is currently disabled)
Request Body
Content-Type: application/json
| Parameter | Required | Type | Description |
|---|---|---|---|
model | Yes | String | ID of the model to use. Example: deepseek-v3-1-250821 |
messages | Yes | Array of Objects | A list of messages comprising the conversation. |
? role | Yes | String | The role of the messages author (e.g., system, user, assistant). |
? content | Yes | String | The contents of the message. |
max_tokens | No | Integer | The maximum number of tokens to generate in the completion. The total length of input and output tokens is limited by the model's context length. |
temperature | No | Number | Sampling temperature, between 0 and 2. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic. |
stream | No | Boolean | If set to true, partial message deltas will be sent as a stream using Server-Sent Events (SSE). The stream is terminated by a data: [DONE] message. |
stream_options | No | Object | Options for streaming response. Only applicable when stream is set to true. |
? include_usage | No | Boolean | If set to true, an additional chunk will be streamed before the data: [DONE] message. The usage field on this chunk shows the token usage statistics for the entire request. |
thinking | No | Object | For models supporting depth of reasoning, this field controls whether the thinking capability is enabled. |
? type | No | String | enabled: Forcefully enable thinking.<br>disabled: Forcefully disable thinking.<br>auto: Allow the model to decide. |
Request Example
{
"model": "deepseek-v3-1-250821",
"max_tokens": 1000,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello"
}
],
"temperature": 1.0,
"stream": true,
"stream_options": {
"include_usage": true
},
"thinking": {
"type": "enabled"
}
}
Response
HTTP Status Code: 200 OK
Content-Type: application/json
(Note: When stream: true, the response format is an SSE stream, where each data chunk is a JSON string with the following structure. Below is the standard JSON structure for non-streaming or fully assembled streaming requests.)
| Field | Type | Description |
|---|---|---|
id | String | A unique identifier for the chat completion. |
object | String | The object type, which is always chat.completion or chat.completion.chunk. |
created | Integer | The Unix timestamp (in seconds) of when the chat completion was created. |
choices | Array of Objects | A list of chat completion choices. |
? index | Integer | The index of the choice in the list of choices. |
? message | Object | A chat completion message generated by the model (usually delta in streaming requests). |
? role | String | The role of the author (usually assistant). |
? content | String | The contents of the generated message. |
? finish_reason | String | The reason the model stopped generating tokens (e.g., stop, length). |
usage | Object | Usage statistics for the completion request. |
? prompt_tokens | Integer | Number of tokens in the prompt. |
? completion_tokens | Integer | Number of tokens in the generated completion. |
? total_tokens | Integer | Total number of tokens used in the request (prompt + completion). |
Response Example (JSON)
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\r
\r
Hello there, how may I assist you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}
Frequently Asked Questions (FAQs)
Q1: How can I get token usage statistics in a streaming (SSE) request?
A1: You need to set stream to true in the request body and configure stream_options: {"include_usage": true}. By doing this, the server will push an additional data chunk right before sending the data: [DONE] termination marker. The choices array in this chunk will be empty, but the usage field will contain the complete token statistics for the entire request.
Q2: What is the difference between enabled and auto in the thinking parameter?
A2:
enabled: Forces the model to perform "Chain of Thought" reasoning before returning the final answer. This generally yields higher-quality reasoning results but may increase response time (Time to First Token) and output token consumption.auto: Delegates the decision to the model. The model will automatically determine whether deep thinking is necessary based on the complexity of the questions in yourmessages.
Q3: Why did the request return a 401 Unauthorized error?
A3: This usually indicates an authentication failure. Please check if the Authorization field in your Request Headers is formatted correctly as Bearer {{YOUR_API_KEY}}, and ensure that your API Key is valid and has not expired.
Q4: Will the thought process content be returned if thinking is enabled?
A4: According to standard OpenAI-compatible format implementations, if streaming output is used, the thought process content may be returned within specific identifiers (such as <think>...</think> tags) or via a specific chunk field. When parsing the frontend streaming data, please make sure to handle the reasoning process within the content so you can accurately display a "Thinking..." UI effect to the user.
Q5: What does it mean if I encounter finish_reason: "length"?
A5: This indicates that the model's output has reached the max_tokens limit you set, or it has hit the maximum context window limit of the model itself, causing the response to be truncated. If the answer is incomplete, you can try increasing the max_tokens value.