Create Chat Completion (Streaming)

Official Documentation Reference: OpenAI Chat Create

Endpoint Description

Given a prompt, the model will return one or more predicted chat completions. This endpoint supports both standard and streaming (SSE) responses, and can also return the log probabilities of alternative tokens at each position.

Request Details

Method: POST
Endpoint: https://api.codingplanx.ai/v1/chat/completions

Headers

Parameter	Required	Type	Example	Description
`Content-Type`	Yes	string	`application/json`	Data format
`Accept`	Yes	string	`application/json`	Accepted data format
`Authorization`	No*	string	`Bearer {{YOUR_API_KEY}}`	Authentication token (typically required for actual API calls)
`X-Forwarded-Host`	No	string	`localhost:5173`	Proxy hostname (typically does not need to be passed manually)

Request Body

Parameter	Type	Required	Description
`model`	string	Yes	ID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API.
`messages`	array	Yes	A list of messages comprising the conversation so far. Contains `role` and `content` fields.
`tools`	array	Yes	A list of tools the model may call. Currently, only functions are supported as a tool.
`tool_choice`	object	Yes	Controls which (if any) function is called by the model. `none` means the model will not call a function, `auto` means the model will automatically choose, or you can force a specific function.
`extra_body`	object	Yes	Additional parameters. Contains `enable_thinking` (boolean) to set whether to enable thinking mode (requires model support).
`temperature`	integer/float	No	Sampling temperature, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or `top_p` but not both.
`top_p`	integer/float	No	Nucleus sampling parameter. 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or `temperature` but not both.
`n`	integer	No	How many chat completion choices to generate for each input message. Defaults to 1.
`stream`	boolean	No	Whether to enable streaming output. If set to `true`, partial message deltas will be sent as Server-Sent Events (SSE) and the stream will terminate with a `data: [DONE]` message.
`stop`	string/array	No	Up to 4 sequences where the API will stop generating further tokens. Defaults to null.
`max_tokens`	integer	No	The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.
`presence_penalty`	number	No	Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
`frequency_penalty`	number	No	Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
`logit_bias`	object	No	Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID) to an associated bias value from -100 to 100.
`user`	string	No	A unique identifier representing your end-user, which can help monitor and detect abuse.
`response_format`	object	No	Specifies the format that the model must output. Setting to `{ "type": "json_object" }` enables JSON mode. Note: When using JSON mode, you must also instruct the model to produce JSON yourself via a system or user message, otherwise it may generate an endless stream of whitespace.
`seen`	integer	No	Sets the seed for deterministic sampling (Beta feature).

Request Example

{
  "model": "gpt-5-mini",
  "max_tokens": 1000,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "你好"
    }
  ],
  "temperature": 1.0,
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

Response Details

Content-Type: application/json (Non-streaming) or text/event-stream (When stream: true)

Response Parameters (Non-streaming Structure)

Parameter	Type	Description
`id`	string	A unique identifier for the chat completion.
`object`	string	The object type, which is typically `chat.completion` or `chat.completion.chunk`.
`created`	integer	The Unix timestamp (in seconds) of when the chat completion was created.
`choices`	array	A list of chat completion choices.
`choices[].index`	integer	The index of the choice in the list of choices.
`choices[].message`	object	A chat completion message generated by the model, containing `role` and `content`. In streaming responses, this is represented as `delta`.
`choices[].finish_reason`	string	The reason the model stopped generating tokens (e.g., `stop`, `length`, etc.).
`usage`	object	Usage statistics for the completion request.
`usage.prompt_tokens`	integer	Number of tokens in the prompt.
`usage.completion_tokens`	integer	Number of tokens generated by the model.
`usage.total_tokens`	integer	Total number of tokens used in the request (prompt + completion).

Response Example (Non-streaming / Aggregated Streaming Result)

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\r
\r
Hello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

(Note: If stream: true is specified in the request, the response body will be a stream of Server-Sent Events (SSE) separated by newlines, with each line starting with data: {...} and finally terminating with data: [DONE].)

FAQs (Frequently Asked Questions)

Q1: How do I parse a streaming response when stream: true is enabled? A: When stream: true is enabled, the API returns data via the Server-Sent Events (SSE) protocol. The client must read the response stream line by line, locate the strings starting with the data: prefix, and parse the subsequent text into JSON. Please note that the final message of the data stream is always data: [DONE], which signals the end of the stream. Ensure your parser filters out this termination marker.

Q2: Why does my request hang or return an infinite blank stream after setting response_format to JSON? A: When you enable JSON mode by using { "type": "json_object" }, you must explicitly instruct the model to "output in JSON format" within the messages array (typically in the system prompt). Without this explicit text instruction, the model may get stuck in a loop generating endless whitespace characters until it triggers the max_tokens limit.

Q3: Can I adjust both temperature and top_p parameters at the same time? A: The official documentation strongly advises against adjusting both parameters simultaneously. Both are mechanisms designed to control the randomness of the model's output. If you want a more stable and predictable output, lower the temperature or decrease top_p; if you desire a more creative response, increase one of them. Modifying both at the same time can lead to unpredictable and suboptimal results.

Q4: What is the purpose of the extra_body.enable_thinking parameter? A: This is a specialized extension parameter for this endpoint. When passing {"enable_thinking": true}, if the backend model being used supports "Thinking Mode" or "Chain of Thought (CoT)" (such as certain reasoning-enhanced LLMs), the model will initiate a deep thinking process before outputting the final response. If the current model does not support this feature, the parameter is typically ignored.

Q5: Why is the finish_reason in the response length instead of stop? A: A finish_reason of length indicates that the model was forcefully truncated before it could generate a complete response. This usually occurs because your max_tokens parameter is set too low, or the combined total of input tokens and generated tokens has exceeded the maximum context window limit allowed by the current model. To resolve this, try increasing the max_tokens value.