Audio-to-Text (Whisper-1) API Documentation

1. API Overview

This API provides powerful speech recognition capabilities, converting audio files into highly accurate text. The underlying infrastructure supports the whisper-1 and related models.


2. Request Information

  • Base URL: https://api.codingplanx.ai
  • Endpoint Path: /v1/audio/transcriptions
  • HTTP Method: POST
  • Content-Type: multipart/form-data

3. Request Parameters

3.1 Request Headers

ParameterRequiredTypeExample ValueDescription
Content-TypeNostringmultipart/form-dataDeclares the request body data format.
AuthorizationYesstringBearer YOUR_API_KEY(Note: Standard authentication header. Please include your API Key)

3.2 Request Body (multipart/form-data)

ParameterRequiredTypeExample ValueDescription
fileYesfilefile://.../test.m4aThe audio file object (not the file name) to transcribe. Supported formats are: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
modelYesstringwhisper-1ID of the model to use. Currently available: whisper-1, gpt-4o-mini-transcribe.
languageNostringzhThe language of the input audio. Supplying the input language in ISO-639-1 format (e.g., zh for Chinese, en for English) will improve accuracy and decrease latency.
promptNostringThis is an English audio.An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
response_formatNostringjsonThe format of the transcript output. Defaults to json.<br>Available options: json, text, srt, verbose_json, or vtt.
temperatureNonumber0The sampling temperature, between 0 and 1. Defaults to 0.<br>Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

4. Response Information

4.1 Response Data Structure

When response_format is set to the default json, the following JSON structure is returned:

FieldTypeRequiredDescription
textstringYesThe transcribed text generated from the audio.

4.2 Response Example (HTTP 200 - Success)

{
  "text": "12345678910"
}

(Note: If response_format is set to text, srt, or vtt, the API will directly return raw text or subtitle format text instead of a JSON object.)


5. Request Code Example (cURL)

curl --location --request POST 'https://api.codingplanx.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--header 'Content-Type: multipart/form-data' \
--form 'file=@"/C:/Users/Administrator/Desktop/test.m4a"' \
--form 'model="whisper-1"' \
--form 'language="zh"' \
--form 'response_format="json"'

6. Frequently Asked Questions (FAQs)

Q1: What is the maximum audio file size supported by this API?

A: Typically, the Whisper API limits a single audio file to a maximum of 25 MB. If your audio file is too large, it is recommended to compress the audio (e.g., convert to a lower bitrate mp3) or split the long audio into multiple smaller chunks for separate requests before calling the API.

Q2: Why does my audio transcription have typos or inaccurately recognized specific proper nouns?

A: Speech recognition models may not be sensitive enough to specific industry jargon, names, or uncommon vocabulary. You can solve this by passing the prompt parameter. For example, if you pre-enter these proper nouns in the prompt, the model will be more inclined to use your provided vocabulary style during transcription.

Q3: Can I directly generate video subtitle files through this API?

A: Yes. You simply need to set the response_format parameter to srt or vtt in your request. Upon successful processing, the API will directly return standard subtitle file content with timestamps. You can save this directly as a .srt or .vtt file for use in video players.

Q4: What are the specific benefits of passing the language parameter?

A: Although the Whisper model can automatically detect the language, if there is silence, noise, or an unclear language at the beginning of the audio, automatic detection might take extra time or result in errors. Proactively providing the correct ISO-639-1 language code (e.g., zh for Chinese, ja for Japanese) will not only significantly reduce request latency but also improve the accuracy of text transcription.

Q5: What causes the Unsupported file format error in the request?

A: This usually happens because the uploaded file format is not in the supported list. Please ensure your audio file format is one of the following: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. Also, note that simply renaming the file extension (e.g., changing .avi to .mp4) will not work; it must be a genuinely supported encoding format.