API Documentation: Audio Transcription

This document provides the API specifications for transcribing audio files into text. Powered by advanced speech recognition models (such as gpt-4o-transcribe), this API supports multiple audio input formats and allows customization of output formats and recognition languages.

Official Reference Documentation: OpenAI Speech-to-Text Guides


1. Basic Information

  • API Name: Audio Transcription gpt-4o-transcribe
  • HTTP Method: POST
  • Request URL: https://api.codingplanx.ai/v1/audio/transcriptions

2. Request Headers

ParameterRequiredTypeExample ValueDescription
Content-TypeNostringmultipart/form-dataDeclares the request body format for form file uploads.
AuthorizationYesstringBearer YOUR_API_KEYThe API Key used for endpoint authentication (supplemented according to standard API specifications).

3. Request Body

The request body must be sent in multipart/form-data format:

ParameterRequiredTypeDefaultDescription
fileYesfile-The audio file object (not the file name) to transcribe.<br>Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
modelYesstring-ID of the model to use.<br>Available models: gpt-4o-transcribe, whisper-1, gpt-4o-mini-transcribe, etc.
languageNostring-The language of the input audio. Supplied in ISO-639-1 format (e.g., zh for Chinese, en for English). Providing this parameter is highly recommended to significantly improve recognition accuracy and reduce latency.
promptNostring-An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
response_formatNostringjsonThe format of the transcription output.<br>Optional values: json or text.
temperatureNonumber0The sampling temperature, between 0 and 1.<br>Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

4. Response Specifications

4.1 Response Parameters (JSON Format)

When response_format is set to json (default), the HTTP status code is 200 OK, and the returned JSON structure is as follows:

FieldTypeRequiredDescription
textstringYesThe transcribed text content recognized by the model from the audio.

4.2 Successful Response Examples

Example 1: English Transcription

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

Example 2: Chinese Transcription

{
  "text": "一二三四五六七八九十"
}

5. Code Snippets

cURL

curl --location --request POST 'https://api.codingplanx.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--header 'Content-Type: multipart/form-data' \
--form 'file=@"/path/to/your/audio/test.m4a"' \
--form 'model="gpt-4o-transcribe"' \
--form 'response_format="json"'

Python (Requests)

import requests

url = "https://api.codingplanx.ai/v1/audio/transcriptions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY"
}
payload = {
    "model": "gpt-4o-transcribe",
    "response_format": "json",
    "language": "en"
}
files = {
    "file": ("test.m4a", open("/path/to/your/audio/test.m4a", "rb"), "audio/mp4")
}

response = requests.post(url, headers=headers, data=payload, files=files)

print(response.json())

6. Frequently Asked Questions (FAQs)

Q1: What audio file formats does the API support? A: Currently, supported audio formats include: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm. Please ensure your file format matches the file extension.

Q2: Why is the transcription sometimes slow, or the first few words are inaccurate? A: This usually happens because the model is trying to auto-detect the audio language. It is highly recommended to pass the language parameter in your request (using the ISO-639-1 standard, e.g., zh for Chinese, en for English). This not only reduces processing latency but also significantly improves transcription accuracy.

Q3: How can I make the model recognize specific proper nouns or industry jargon? A: You can use the prompt parameter. Pass the proper nouns, names, or specific punctuation styles you want the model to accurately recognize as a text string into the prompt. The model will reference this context and style during transcription. Note: The language of the prompt must match the language of the audio.

Q4: What is the difference between choosing json and text for response_format? A:

  • Choosing json (default): The API returns a standard JSON object {"text": "transcribed text"}, which is the easiest to parse in most backend applications.
  • Choosing text: The API returns a plain text string directly (Content-Type: text/plain) without an outer JSON structure. This is suitable for lightweight scripts that only need the raw text content.

Q5: How should I set the temperature parameter? A: For most transcription scenarios, it is recommended to keep the default value of 0. Under this setting, the model prioritizes outputting the most logical and deterministic text. If your audio contains heavy background noise or non-standard pronunciation that causes "hallucinations" or missing words under default settings, you can try slightly increasing the temperature (e.g., 0.2 to 0.4) to give the model more flexibility in its predictions.