API Documentation: Audio Transcription
This document provides the API specifications for transcribing audio files into text. Powered by advanced speech recognition models (such as gpt-4o-transcribe), this API supports multiple audio input formats and allows customization of output formats and recognition languages.
Official Reference Documentation: OpenAI Speech-to-Text Guides
1. Basic Information
- API Name: Audio Transcription gpt-4o-transcribe
- HTTP Method:
POST - Request URL:
https://api.codingplanx.ai/v1/audio/transcriptions
2. Request Headers
| Parameter | Required | Type | Example Value | Description |
|---|---|---|---|---|
Content-Type | No | string | multipart/form-data | Declares the request body format for form file uploads. |
Authorization | Yes | string | Bearer YOUR_API_KEY | The API Key used for endpoint authentication (supplemented according to standard API specifications). |
3. Request Body
The request body must be sent in multipart/form-data format:
| Parameter | Required | Type | Default | Description |
|---|---|---|---|---|
file | Yes | file | - | The audio file object (not the file name) to transcribe.<br>Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. |
model | Yes | string | - | ID of the model to use.<br>Available models: gpt-4o-transcribe, whisper-1, gpt-4o-mini-transcribe, etc. |
language | No | string | - | The language of the input audio. Supplied in ISO-639-1 format (e.g., zh for Chinese, en for English). Providing this parameter is highly recommended to significantly improve recognition accuracy and reduce latency. |
prompt | No | string | - | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. |
response_format | No | string | json | The format of the transcription output.<br>Optional values: json or text. |
temperature | No | number | 0 | The sampling temperature, between 0 and 1.<br>Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit. |
4. Response Specifications
4.1 Response Parameters (JSON Format)
When response_format is set to json (default), the HTTP status code is 200 OK, and the returned JSON structure is as follows:
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | The transcribed text content recognized by the model from the audio. |
4.2 Successful Response Examples
Example 1: English Transcription
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
Example 2: Chinese Transcription
{
"text": "一二三四五六七八九十"
}
5. Code Snippets
cURL
curl --location --request POST 'https://api.codingplanx.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: multipart/form-data' \
--form 'file=@"/path/to/your/audio/test.m4a"' \
--form 'model="gpt-4o-transcribe"' \
--form 'response_format="json"'
Python (Requests)
import requests
url = "https://api.codingplanx.ai/v1/audio/transcriptions"
headers = {
"Authorization": "Bearer YOUR_API_KEY"
}
payload = {
"model": "gpt-4o-transcribe",
"response_format": "json",
"language": "en"
}
files = {
"file": ("test.m4a", open("/path/to/your/audio/test.m4a", "rb"), "audio/mp4")
}
response = requests.post(url, headers=headers, data=payload, files=files)
print(response.json())
6. Frequently Asked Questions (FAQs)
Q1: What audio file formats does the API support?
A: Currently, supported audio formats include: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm. Please ensure your file format matches the file extension.
Q2: Why is the transcription sometimes slow, or the first few words are inaccurate?
A: This usually happens because the model is trying to auto-detect the audio language. It is highly recommended to pass the language parameter in your request (using the ISO-639-1 standard, e.g., zh for Chinese, en for English). This not only reduces processing latency but also significantly improves transcription accuracy.
Q3: How can I make the model recognize specific proper nouns or industry jargon?
A: You can use the prompt parameter. Pass the proper nouns, names, or specific punctuation styles you want the model to accurately recognize as a text string into the prompt. The model will reference this context and style during transcription. Note: The language of the prompt must match the language of the audio.
Q4: What is the difference between choosing json and text for response_format?
A:
- Choosing
json(default): The API returns a standard JSON object{"text": "transcribed text"}, which is the easiest to parse in most backend applications. - Choosing
text: The API returns a plain text string directly (Content-Type: text/plain) without an outer JSON structure. This is suitable for lightweight scripts that only need the raw text content.
Q5: How should I set the temperature parameter?
A: For most transcription scenarios, it is recommended to keep the default value of 0. Under this setting, the model prioritizes outputting the most logical and deterministic text. If your audio contains heavy background noise or non-standard pronunciation that causes "hallucinations" or missing words under default settings, you can try slightly increasing the temperature (e.g., 0.2 to 0.4) to give the model more flexibility in its predictions.