Differences in Whisper Results When Executing via Code vs. Console Command with the Same Parameters #1893

Igarugueri · 2023-12-11T17:19:45Z

Igarugueri
Dec 11, 2023

NOTE: I have posted this message in the Show and Tell section, I apologize because this is the appropriate section.

Hello whisper community,

I am encountering unexpected behaviour when using Whisper, OpenAI's voice-to-text transcription model. I've noticed that I get different results when running the model through a Python script compared to direct execution in the console, even though I am using the same parameters in both cases.

-Parameters Used-

Execution Environment: Running on Windows 11, Python 3.9.9
Whisper Version: 20231117
Parameters Used:
In the script: (file_path: str, model_size: str = "small", word_timestamps: bool = True,
language: str = "Spanish", translate: bool = True)
In the console: > whisper file.wav --word_timestamps True --language es --task translate --model small

-Issue-

When using Whisper, I am encountering differences in performance between running it through a Python script and executing the same command directly in the console:

Console Command Execution:
The transcription captures all words accurately and recognizes pauses correctly, aligning well with the audio.
Python Script Execution:
In the Python script, the primary issue lies in the 'segments' section of the output. The script fails to detect pauses accurately, missing several of them compared to the console execution.
Additionally, the quality of the transcription is somewhat a bit worse compared to the console execution.

This discrepancy is puzzling, especially since the parameters and environment are consistent across both methods of execution.

I would like to understand why there is this discrepancy. Could it be due to differences in the execution environment, or is there something else I might be overlooking?

I appreciate any guidance or suggestions you can provide to help me solve this mystery.

Thank you in advance for your time and help!

Best regards,

Igarugueri.

NOTE: Additional Audio File Details
Just as an additional note, here are some key details of the audio file (file.wav) I am using for the transcription:

File Format: WAV / WAVE (Waveform Audio)
Audio Duration: Approximately 112.56 seconds (about 1 minute and 52 seconds)
Audio Codec: PCM signed 16-bit little-endian (pcm_s16le)
Sample Rate: 44,100 Hz
Audio Channels: 2 (stereo)
Bit Rate: Approximately 1,411,200 bits per second (1.41 Mbps)
File Size: Approximately 19.86 MB

glangford · 2023-12-11T18:59:30Z

glangford
Dec 11, 2023

See this discussion, and note the inclusion of beam_size and best_of in the Python equivalent.

Do we have a python signature for translation? #196

1 reply

Igarugueri Dec 11, 2023
Author

Hello glangford,

I want to extend my sincere thanks for taking the time to respond to my query. Your suggestion has been very enlightening, and I greatly appreciate your assistance.

Thank you again for your support and for sharing your expertise.

Best regards,

Igarugueri

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Differences in Whisper Results When Executing via Code vs. Console Command with the Same Parameters #1893

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Differences in Whisper Results When Executing via Code vs. Console Command with the Same Parameters #1893

Uh oh!

Igarugueri Dec 11, 2023

Replies: 1 comment · 1 reply

Uh oh!

glangford Dec 11, 2023

Uh oh!

Igarugueri Dec 11, 2023 Author

Igarugueri
Dec 11, 2023

Replies: 1 comment 1 reply

glangford
Dec 11, 2023

Igarugueri Dec 11, 2023
Author