Create Chat Vision API (Streaming / Non-Streaming)

This API supports multimodal inputs, allowing users to send text and image URLs within a conversation. The model will generate appropriate responses based on the provided prompts and image content. It supports streaming output to provide a smoother, more interactive experience.


Request Parameters

Header Parameters

ParameterRequiredTypeExampleDescription
Content-TypeYesstringapplication/jsonRequest body format
AcceptYesstringapplication/jsonResponse format
AuthorizationNostringBearer {{YOUR_API_KEY}}API key for authentication

Body Parameters

ParameterTypeRequiredDefaultDescription
modelstringYes-ID of the model to use (e.g., gpt-4o, gpt-4-vision-preview).
messagesarrayYes-A list of messages comprising the conversation so far. See the messages structure below.
streambooleanNofalseWhether to enable streaming output. If enabled, partial message deltas will be sent via Server-Sent Events (SSE).
temperaturenumberNo1Sampling temperature (0-2). Higher values make output more random, while lower values make it more deterministic.
top_pnumberNo1Nucleus sampling probability. It is recommended to alter this or temperature, but not both.
max_tokensintegerNoinfThe maximum number of tokens to generate in the chat completion.
nintegerNo1How many chat completion choices to generate for each input message.
presence_penaltynumberNo0(-2.0 to 2.0) Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
frequency_penaltynumberNo0(-2.0 to 2.0) Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
response_formatobjectNo-Specifies the output format, e.g., {"type": "json_object"}.
stopstring/arrayNonullUp to 4 sequences where the API will stop generating further tokens.
userstringNo-A unique identifier representing your end-user, which can help monitor and detect abuse.

messages Object Structure

Each item in the message list contains the following fields:

  • role: (string) The role of the message's author. Options: system, user, assistant, tool.
  • content: (string or array) The contents of the message.
    • In Vision Mode, content is an array of objects containing types text and image_url.

Request Example

Mixed Text and Vision Request (JSON)

{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "You are a professional image analysis assistant."
      },
      {
            "role": "user",
            "content": [
                {
                    "type": "text", 
                    "text": "What is in this image? Please describe it in detail."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png"
                    }
                }
            ]
        }
    ],
    "stream": true
}

Response Explanation

Non-Streaming Response (stream: false)

ParameterTypeDescription
idstringA unique identifier for the chat completion.
objectstringThe object type, which is always chat.completion.
createdintegerThe Unix timestamp (in seconds) of when the chat completion was created.
choicesarrayA list of chat completion choices generated by the model.
choices[n].messageobjectA chat completion message generated by the model (contains role and content).
choices[n].finish_reasonstringThe reason the model stopped generating tokens (e.g., stop, length).
usageobjectUsage statistics for the completion request.

Response Example

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "This image shows a tranquil lake with a backdrop of rolling mountains."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 120,
        "completion_tokens": 35,
        "total_tokens": 155
    }
}

Streaming Response (stream: true)

When stream is set to true, the API returns a text/event-stream. Each line begins with data: followed by a JSON string. The stream is terminated by a data: [DONE] message.


FAQs (Frequently Asked Questions)

Q1: How can I upload local images instead of using URLs? A: This API primarily supports image URLs. If you only have local images, it is recommended to upload them to your image hosting service or cloud storage (OSS) first. Alternatively, you can convert the image into a Base64 encoded string and format it as data:image/jpeg;base64,{base64_encode_data}, then pass it into the url field.

Q2: Why are there no usage statistics in my streaming response? A: In standard OpenAI-compatible streaming protocols, usage is typically only returned in the final data chunk or by setting a specific stream_options parameter. Please check if your model version supports returning token counts within a stream.

Q3: What are the size and format limits for image recognition? A: Generally, PNG, JPEG, WEBP, and GIF (non-animated) formats are supported. It is recommended that image sizes do not exceed 20MB. For optimal recognition results, a resolution of 512x512 or higher is advised.

Q4: What should I do if my request returns a "401 Unauthorized" error? A: Please ensure that you are correctly passing the Authorization field in the Header. The format should be Bearer followed by your API KEY. Also, verify that the KEY is valid and your account balance is sufficient.

Q5: Does the temperature parameter affect image recognition? A: Yes, it does. The temperature affects the linguistic creativity of the model when describing an image. If you need highly objective and rigorous image descriptions, a lower value (e.g., 0.2) is recommended. If you want more vivid and engaging descriptions, you can set a higher value (e.g., 0.8).

Q6: How do I process multiple images in a single request? A: You can place multiple objects with the image_url type into the messages.content array. The model will attempt to understand the context and content of all provided images simultaneously.