Audio-to-Text (Whisper-1) API Documentation

1. API Overview

This API provides powerful speech recognition capabilities, converting audio files into highly accurate text. The underlying infrastructure supports the whisper-1 and related models.

Official Reference: OpenAI Speech-to-Text Guide

2. Request Information

Base URL: https://api.codingplanx.ai
Endpoint Path: /v1/audio/transcriptions
HTTP Method: POST
Content-Type: multipart/form-data

3. Request Parameters

3.1 Request Headers

Parameter	Required	Type	Example Value	Description
`Content-Type`	No	string	`multipart/form-data`	Declares the request body data format.
`Authorization`	Yes	string	`Bearer YOUR_API_KEY`	(Note: Standard authentication header. Please include your API Key)

3.2 Request Body (multipart/form-data)

Parameter	Required	Type	Example Value	Description
`file`	Yes	file	`file://.../test.m4a`	The audio file object (not the file name) to transcribe. Supported formats are: `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, or `webm`.
`model`	Yes	string	`whisper-1`	ID of the model to use. Currently available: `whisper-1`, `gpt-4o-mini-transcribe`.
`language`	No	string	`zh`	The language of the input audio. Supplying the input language in `ISO-639-1` format (e.g., `zh` for Chinese, `en` for English) will improve accuracy and decrease latency.
`prompt`	No	string	`This is an English audio.`	An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
`response_format`	No	string	`json`	The format of the transcript output. Defaults to `json`.<br>Available options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.
`temperature`	No	number	`0`	The sampling temperature, between `0` and `1`. Defaults to `0`.<br>Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

4. Response Information

4.1 Response Data Structure

When response_format is set to the default json, the following JSON structure is returned:

Field	Type	Required	Description
`text`	string	Yes	The transcribed text generated from the audio.

4.2 Response Example (HTTP 200 - Success)

{
  "text": "12345678910"
}

(Note: If response_format is set to text, srt, or vtt, the API will directly return raw text or subtitle format text instead of a JSON object.)

5. Request Code Example (cURL)

curl --location --request POST 'https://api.codingplanx.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--header 'Content-Type: multipart/form-data' \
--form 'file=@"/C:/Users/Administrator/Desktop/test.m4a"' \
--form 'model="whisper-1"' \
--form 'language="zh"' \
--form 'response_format="json"'

6. Frequently Asked Questions (FAQs)

Q1: What is the maximum audio file size supported by this API?

A: Typically, the Whisper API limits a single audio file to a maximum of 25 MB. If your audio file is too large, it is recommended to compress the audio (e.g., convert to a lower bitrate mp3) or split the long audio into multiple smaller chunks for separate requests before calling the API.

Q2: Why does my audio transcription have typos or inaccurately recognized specific proper nouns?

A: Speech recognition models may not be sensitive enough to specific industry jargon, names, or uncommon vocabulary. You can solve this by passing the prompt parameter. For example, if you pre-enter these proper nouns in the prompt, the model will be more inclined to use your provided vocabulary style during transcription.

Q3: Can I directly generate video subtitle files through this API?

A: Yes. You simply need to set the response_format parameter to srt or vtt in your request. Upon successful processing, the API will directly return standard subtitle file content with timestamps. You can save this directly as a .srt or .vtt file for use in video players.

Q4: What are the specific benefits of passing the language parameter?

A: Although the Whisper model can automatically detect the language, if there is silence, noise, or an unclear language at the beginning of the audio, automatic detection might take extra time or result in errors. Proactively providing the correct ISO-639-1 language code (e.g., zh for Chinese, ja for Japanese) will not only significantly reduce request latency but also improve the accuracy of text transcription.

Q5: What causes the Unsupported file format error in the request?

A: This usually happens because the uploaded file format is not in the supported list. Please ensure your audio file format is one of the following: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. Also, note that simply renaming the file extension (e.g., changing .avi to .mp4) will not work; it must be a genuinely supported encoding format.