-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Description
Bug Description
When using the SpeechRecognition library with OpenAI Whisper API, only the first few seconds of audio files are transcribed, regardless of the actual file duration or size.
Steps to Reproduce
- Use a WAV audio file longer than ~30 seconds
- Run transcription using:
python -m speech_recognition.recognizers.whisper_api.openai --model gpt-4o-transcribe audio_file.wav - Observe that only the first portion is transcribed
% uv run --python 3.12 --with 'SpeechRecognition[openai]==3.14.2' -- python -m speech_recognition.recognizers.whisper_api.openai -l ja long_audio.wav
Here is long_audio.wav example:
https://notebooklm.google.com/notebook/e7297b2e-e363-4e77-bff3-8d71e104d5a2
Expected Behavior
The entire audio file should be transcribed.
Actual Behavior
Only the first few seconds are transcribed (e.g., 18 characters from a 7.6-minute file).
Root Cause Analysis
The issue appears to be in AudioData.get_wav_data() method. When processing audio files, the method only converts a small portion of the audio data:
- Original file: 21.89 MB, 456 seconds
- WAV conversion result: 0.08 MB (abnormally small)
- This suggests only ~2-3 seconds of audio are being processed
Evidence
Testing the same audio file directly with OpenAI Python SDK works perfectly:
- SpeechRecognition library: 18 characters transcribed
- Direct OpenAI API: 2,829 characters (complete transcription)
Environment
- SpeechRecognition version: 3.14.2
- Python: 3.12
- Audio format: 24kHz, 16-bit, mono PCM WAV
- Models tested: whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe (all show same issue)
Workaround
Use OpenAI Python SDK directly instead of SpeechRecognition library.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels