How to obtain word-level segmentation timestamps? #1855

ZayneHuang · 2023-11-30T03:22:09Z

ZayneHuang
Nov 30, 2023

Hi,

I am currently facing a challenge with the transcription output from Whisper. Current timestamps include pauses between words, but I require precise start and end times for each individual word, excluding any pauses. I think it is the segmention timestamp of each word.
I am using Whisper installed from pypi in version 20231117. Does anyone have idea on this issue? I would greatly appreciate your insights.

Answered by jongwook

Nov 30, 2023

Your observation is correct; Whisper is not explicitly trained for word-level timestamps and the current outputs are produced by an inference-time trick, which does not give perfectly accurate timing, especially when dealing with pauses..

View full answer

jongwook · 2023-11-30T09:09:02Z

jongwook
Nov 30, 2023
Maintainer

--word_timestamps True in the command line or word_timestamps=True as an argument to transcribe() will give you word-level timestamps in the results

5 replies

ZayneHuang Nov 30, 2023
Author

Thank you for your response. Yes, I attempted to include --word_timestamps True. But I noticed that the output doesn't seem to provide word segmentation timestamps as expected. In the example transcription result provided:
{"word": " me", "start": 364.6, "end": 364.82, "probability": 0.9959644079208374}, {"word": " around", "start": 364.82, "end": 365.18, "probability": 0.9976060390472412},
the end time of the word "me" aligns precisely with the start time of the word "around".
However, results from other ASR tools like MS Azure provide the exact start and end time of each word (there is an interval between the end of "me" and the offset of "around"), like the result of the same part of audio:
{ "Duration": 1400000, "Offset": 3437300000, "Word": "me" }, { "Duration": 5100000, "Offset": 3438800000, "Word": "around" },

It appears that the word-level timestamps from Whisper might not represent the segmentation of each word but rather include pauses between words. I'm uncertain if I've misused the model, resulting in an inaccurate output. I'd greatly appreciate further insights or guidance on how to obtain precise word-level segmentation timestamps.
Thank you once again for your assistance.

jongwook Nov 30, 2023
Maintainer

Your observation is correct; Whisper is not explicitly trained for word-level timestamps and the current outputs are produced by an inference-time trick, which does not give perfectly accurate timing, especially when dealing with pauses..