Create Chat Completion (Streaming)
Official Documentation Reference: OpenAI Chat Create
Endpoint Description
Given a prompt, the model will return one or more predicted chat completions. This endpoint supports both standard and streaming (SSE) responses, and can also return the log probabilities of alternative tokens at each position.
Request Details
- Method:
POST - Endpoint:
https://api.codingplanx.ai/v1/chat/completions
Headers
| Parameter | Required | Type | Example | Description |
|---|---|---|---|---|
Content-Type | Yes | string | application/json | Data format |
Accept | Yes | string | application/json | Accepted data format |
Authorization | No* | string | Bearer {{YOUR_API_KEY}} | Authentication token (typically required for actual API calls) |
X-Forwarded-Host | No | string | localhost:5173 | Proxy hostname (typically does not need to be passed manually) |
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | ID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API. |
messages | array | Yes | A list of messages comprising the conversation so far. Contains role and content fields. |
tools | array | Yes | A list of tools the model may call. Currently, only functions are supported as a tool. |
tool_choice | object | Yes | Controls which (if any) function is called by the model. none means the model will not call a function, auto means the model will automatically choose, or you can force a specific function. |
extra_body | object | Yes | Additional parameters. Contains enable_thinking (boolean) to set whether to enable thinking mode (requires model support). |
temperature | integer/float | No | Sampling temperature, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both. |
top_p | integer/float | No | Nucleus sampling parameter. 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both. |
n | integer | No | How many chat completion choices to generate for each input message. Defaults to 1. |
stream | boolean | No | Whether to enable streaming output. If set to true, partial message deltas will be sent as Server-Sent Events (SSE) and the stream will terminate with a data: [DONE] message. |
stop | string/array | No | Up to 4 sequences where the API will stop generating further tokens. Defaults to null. |
max_tokens | integer | No | The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length. |
presence_penalty | number | No | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
frequency_penalty | number | No | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
logit_bias | object | No | Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID) to an associated bias value from -100 to 100. |
user | string | No | A unique identifier representing your end-user, which can help monitor and detect abuse. |
response_format | object | No | Specifies the format that the model must output. Setting to { "type": "json_object" } enables JSON mode. Note: When using JSON mode, you must also instruct the model to produce JSON yourself via a system or user message, otherwise it may generate an endless stream of whitespace. |
seen | integer | No | Sets the seed for deterministic sampling (Beta feature). |
Request Example
{
"model": "gpt-5-mini",
"max_tokens": 1000,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "你好"
}
],
"temperature": 1.0,
"stream": true,
"stream_options": {
"include_usage": true
}
}
Response Details
- Content-Type:
application/json(Non-streaming) ortext/event-stream(Whenstream: true)
Response Parameters (Non-streaming Structure)
| Parameter | Type | Description |
|---|---|---|
id | string | A unique identifier for the chat completion. |
object | string | The object type, which is typically chat.completion or chat.completion.chunk. |
created | integer | The Unix timestamp (in seconds) of when the chat completion was created. |
choices | array | A list of chat completion choices. |
choices[].index | integer | The index of the choice in the list of choices. |
choices[].message | object | A chat completion message generated by the model, containing role and content. In streaming responses, this is represented as delta. |
choices[].finish_reason | string | The reason the model stopped generating tokens (e.g., stop, length, etc.). |
usage | object | Usage statistics for the completion request. |
usage.prompt_tokens | integer | Number of tokens in the prompt. |
usage.completion_tokens | integer | Number of tokens generated by the model. |
usage.total_tokens | integer | Total number of tokens used in the request (prompt + completion). |
Response Example (Non-streaming / Aggregated Streaming Result)
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\r
\r
Hello there, how may I assist you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}
(Note: If stream: true is specified in the request, the response body will be a stream of Server-Sent Events (SSE) separated by newlines, with each line starting with data: {...} and finally terminating with data: [DONE].)
FAQs (Frequently Asked Questions)
Q1: How do I parse a streaming response when stream: true is enabled?
A: When stream: true is enabled, the API returns data via the Server-Sent Events (SSE) protocol. The client must read the response stream line by line, locate the strings starting with the data: prefix, and parse the subsequent text into JSON. Please note that the final message of the data stream is always data: [DONE], which signals the end of the stream. Ensure your parser filters out this termination marker.
Q2: Why does my request hang or return an infinite blank stream after setting response_format to JSON?
A: When you enable JSON mode by using { "type": "json_object" }, you must explicitly instruct the model to "output in JSON format" within the messages array (typically in the system prompt). Without this explicit text instruction, the model may get stuck in a loop generating endless whitespace characters until it triggers the max_tokens limit.
Q3: Can I adjust both temperature and top_p parameters at the same time?
A: The official documentation strongly advises against adjusting both parameters simultaneously. Both are mechanisms designed to control the randomness of the model's output. If you want a more stable and predictable output, lower the temperature or decrease top_p; if you desire a more creative response, increase one of them. Modifying both at the same time can lead to unpredictable and suboptimal results.
Q4: What is the purpose of the extra_body.enable_thinking parameter?
A: This is a specialized extension parameter for this endpoint. When passing {"enable_thinking": true}, if the backend model being used supports "Thinking Mode" or "Chain of Thought (CoT)" (such as certain reasoning-enhanced LLMs), the model will initiate a deep thinking process before outputting the final response. If the current model does not support this feature, the parameter is typically ignored.
Q5: Why is the finish_reason in the response length instead of stop?
A: A finish_reason of length indicates that the model was forcefully truncated before it could generate a complete response. This usually occurs because your max_tokens parameter is set too low, or the combined total of input tokens and generated tokens has exceeded the maximum context window limit allowed by the current model. To resolve this, try increasing the max_tokens value.