Issues with clip_timestamps: Slow Transcription and Nonsensical Results #2551

esphoenixc · 2025-03-18T22:58:13Z

esphoenixc
Mar 18, 2025

When the following value is provided for the clip_timestamps area, it takes far too long to transcribe a one-minute audio file—approximately 60 to 120 seconds, which is absurd. The intention behind using clip_timestamps is to prevent Whisper from generating hallucinations by only transcribing the speech areas detected via silero-vad. However, when the audio contains multiple languages, even reducing the clip_timestamps value by a factor of ten still results in a lengthy transcription process, and the output becomes gibberish with nonsensical, repeating words and strange symbols. Has anyone else experienced this issue?

I am using Whisper large V3 Turbo model.

(start=2.722, end=5.534, duration=2.812), VADSegment(start=6.338, end=9.406, duration=3.068), VADSegment(start=10.37, end=13.566, duration=3.196), VADSegment(start=13.986, end=15.358, duration=1.372), VADSegment(start=15.522, end=19.39, duration=3.868), VADSegment(start=19.81, end=21.694, duration=1.884), VADSegment(start=22.882, end=24.862, duration=1.98), VADSegment(start=26.37, end=27.71, duration=1.34), VADSegment(start=28.386, end=30.91, duration=2.524), VADSegment(start=31.394, end=38.078, duration=6.684), VADSegment(start=38.882, end=41.054, duration=2.172), VADSegment(start=41.826, end=43.454, duration=1.628), VADSegment(start=43.778, end=45.598, duration=1.82), VADSegment(start=46.082, end=47.262, duration=1.18)])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues with clip_timestamps: Slow Transcription and Nonsensical Results #2551

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Issues with clip_timestamps: Slow Transcription and Nonsensical Results #2551

Uh oh!

esphoenixc Mar 18, 2025

Replies: 0 comments

esphoenixc
Mar 18, 2025