Create Chat Completion (Streaming)

Official Documentation Reference: OpenAI Chat Create

Endpoint Description

Given a prompt, the model will return one or more predicted chat completions. This endpoint supports both standard and streaming (SSE) responses, and can also return the log probabilities of alternative tokens at each position.

Request Details

  • Method: POST
  • Endpoint: https://api.codingplanx.ai/v1/chat/completions

Headers

ParameterRequiredTypeExampleDescription
Content-TypeYesstringapplication/jsonData format
AcceptYesstringapplication/jsonAccepted data format
AuthorizationNo*stringBearer {{YOUR_API_KEY}}Authentication token (typically required for actual API calls)
X-Forwarded-HostNostringlocalhost:5173Proxy hostname (typically does not need to be passed manually)

Request Body

ParameterTypeRequiredDescription
modelstringYesID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API.
messagesarrayYesA list of messages comprising the conversation so far. Contains role and content fields.
toolsarrayYesA list of tools the model may call. Currently, only functions are supported as a tool.
tool_choiceobjectYesControls which (if any) function is called by the model. none means the model will not call a function, auto means the model will automatically choose, or you can force a specific function.
extra_bodyobjectYesAdditional parameters. Contains enable_thinking (boolean) to set whether to enable thinking mode (requires model support).
temperatureinteger/floatNoSampling temperature, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
top_pinteger/floatNoNucleus sampling parameter. 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
nintegerNoHow many chat completion choices to generate for each input message. Defaults to 1.
streambooleanNoWhether to enable streaming output. If set to true, partial message deltas will be sent as Server-Sent Events (SSE) and the stream will terminate with a data: [DONE] message.
stopstring/arrayNoUp to 4 sequences where the API will stop generating further tokens. Defaults to null.
max_tokensintegerNoThe maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.
presence_penaltynumberNoNumber between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
frequency_penaltynumberNoNumber between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
logit_biasobjectNoModify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID) to an associated bias value from -100 to 100.
userstringNoA unique identifier representing your end-user, which can help monitor and detect abuse.
response_formatobjectNoSpecifies the format that the model must output. Setting to { "type": "json_object" } enables JSON mode. Note: When using JSON mode, you must also instruct the model to produce JSON yourself via a system or user message, otherwise it may generate an endless stream of whitespace.
seenintegerNoSets the seed for deterministic sampling (Beta feature).

Request Example

{
  "model": "gpt-5-mini",
  "max_tokens": 1000,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "你好"
    }
  ],
  "temperature": 1.0,
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

Response Details

  • Content-Type: application/json (Non-streaming) or text/event-stream (When stream: true)

Response Parameters (Non-streaming Structure)

ParameterTypeDescription
idstringA unique identifier for the chat completion.
objectstringThe object type, which is typically chat.completion or chat.completion.chunk.
createdintegerThe Unix timestamp (in seconds) of when the chat completion was created.
choicesarrayA list of chat completion choices.
choices[].indexintegerThe index of the choice in the list of choices.
choices[].messageobjectA chat completion message generated by the model, containing role and content. In streaming responses, this is represented as delta.
choices[].finish_reasonstringThe reason the model stopped generating tokens (e.g., stop, length, etc.).
usageobjectUsage statistics for the completion request.
usage.prompt_tokensintegerNumber of tokens in the prompt.
usage.completion_tokensintegerNumber of tokens generated by the model.
usage.total_tokensintegerTotal number of tokens used in the request (prompt + completion).

Response Example (Non-streaming / Aggregated Streaming Result)

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\r
\r
Hello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

(Note: If stream: true is specified in the request, the response body will be a stream of Server-Sent Events (SSE) separated by newlines, with each line starting with data: {...} and finally terminating with data: [DONE].)


FAQs (Frequently Asked Questions)

Q1: How do I parse a streaming response when stream: true is enabled? A: When stream: true is enabled, the API returns data via the Server-Sent Events (SSE) protocol. The client must read the response stream line by line, locate the strings starting with the data: prefix, and parse the subsequent text into JSON. Please note that the final message of the data stream is always data: [DONE], which signals the end of the stream. Ensure your parser filters out this termination marker.

Q2: Why does my request hang or return an infinite blank stream after setting response_format to JSON? A: When you enable JSON mode by using { "type": "json_object" }, you must explicitly instruct the model to "output in JSON format" within the messages array (typically in the system prompt). Without this explicit text instruction, the model may get stuck in a loop generating endless whitespace characters until it triggers the max_tokens limit.

Q3: Can I adjust both temperature and top_p parameters at the same time? A: The official documentation strongly advises against adjusting both parameters simultaneously. Both are mechanisms designed to control the randomness of the model's output. If you want a more stable and predictable output, lower the temperature or decrease top_p; if you desire a more creative response, increase one of them. Modifying both at the same time can lead to unpredictable and suboptimal results.

Q4: What is the purpose of the extra_body.enable_thinking parameter? A: This is a specialized extension parameter for this endpoint. When passing {"enable_thinking": true}, if the backend model being used supports "Thinking Mode" or "Chain of Thought (CoT)" (such as certain reasoning-enhanced LLMs), the model will initiate a deep thinking process before outputting the final response. If the current model does not support this feature, the parameter is typically ignored.

Q5: Why is the finish_reason in the response length instead of stop? A: A finish_reason of length indicates that the model was forcefully truncated before it could generate a complete response. This usually occurs because your max_tokens parameter is set too low, or the combined total of input tokens and generated tokens has exceeded the maximum context window limit allowed by the current model. To resolve this, try increasing the max_tokens value.