Audio-to-Text (Whisper-1) API Documentation
1. API Overview
This API provides powerful speech recognition capabilities, converting audio files into highly accurate text. The underlying infrastructure supports the whisper-1 and related models.
- Official Reference: OpenAI Speech-to-Text Guide
2. Request Information
- Base URL:
https://api.codingplanx.ai - Endpoint Path:
/v1/audio/transcriptions - HTTP Method:
POST - Content-Type:
multipart/form-data
3. Request Parameters
3.1 Request Headers
| Parameter | Required | Type | Example Value | Description |
|---|---|---|---|---|
Content-Type | No | string | multipart/form-data | Declares the request body data format. |
Authorization | Yes | string | Bearer YOUR_API_KEY | (Note: Standard authentication header. Please include your API Key) |
3.2 Request Body (multipart/form-data)
| Parameter | Required | Type | Example Value | Description |
|---|---|---|---|---|
file | Yes | file | file://.../test.m4a | The audio file object (not the file name) to transcribe. Supported formats are: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. |
model | Yes | string | whisper-1 | ID of the model to use. Currently available: whisper-1, gpt-4o-mini-transcribe. |
language | No | string | zh | The language of the input audio. Supplying the input language in ISO-639-1 format (e.g., zh for Chinese, en for English) will improve accuracy and decrease latency. |
prompt | No | string | This is an English audio. | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. |
response_format | No | string | json | The format of the transcript output. Defaults to json.<br>Available options: json, text, srt, verbose_json, or vtt. |
temperature | No | number | 0 | The sampling temperature, between 0 and 1. Defaults to 0.<br>Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit. |
4. Response Information
4.1 Response Data Structure
When response_format is set to the default json, the following JSON structure is returned:
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | The transcribed text generated from the audio. |
4.2 Response Example (HTTP 200 - Success)
{
"text": "12345678910"
}
(Note: If response_format is set to text, srt, or vtt, the API will directly return raw text or subtitle format text instead of a JSON object.)
5. Request Code Example (cURL)
curl --location --request POST 'https://api.codingplanx.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--header 'Content-Type: multipart/form-data' \
--form 'file=@"/C:/Users/Administrator/Desktop/test.m4a"' \
--form 'model="whisper-1"' \
--form 'language="zh"' \
--form 'response_format="json"'
6. Frequently Asked Questions (FAQs)
Q1: What is the maximum audio file size supported by this API?
A: Typically, the Whisper API limits a single audio file to a maximum of 25 MB. If your audio file is too large, it is recommended to compress the audio (e.g., convert to a lower bitrate mp3) or split the long audio into multiple smaller chunks for separate requests before calling the API.
Q2: Why does my audio transcription have typos or inaccurately recognized specific proper nouns?
A: Speech recognition models may not be sensitive enough to specific industry jargon, names, or uncommon vocabulary. You can solve this by passing the
promptparameter. For example, if you pre-enter these proper nouns in theprompt, the model will be more inclined to use your provided vocabulary style during transcription.
Q3: Can I directly generate video subtitle files through this API?
A: Yes. You simply need to set the
response_formatparameter tosrtorvttin your request. Upon successful processing, the API will directly return standard subtitle file content with timestamps. You can save this directly as a.srtor.vttfile for use in video players.
Q4: What are the specific benefits of passing the language parameter?
A: Although the Whisper model can automatically detect the language, if there is silence, noise, or an unclear language at the beginning of the audio, automatic detection might take extra time or result in errors. Proactively providing the correct
ISO-639-1language code (e.g.,zhfor Chinese,jafor Japanese) will not only significantly reduce request latency but also improve the accuracy of text transcription.
Q5: What causes the Unsupported file format error in the request?
A: This usually happens because the uploaded file format is not in the supported list. Please ensure your audio file format is one of the following:
flac,mp3,mp4,mpeg,mpga,m4a,ogg,wav, orwebm. Also, note that simply renaming the file extension (e.g., changing .avi to .mp4) will not work; it must be a genuinely supported encoding format.