Questions regarding fine-tuning a Whisper model with timestamps #838

jumon · 2023-01-13T14:49:28Z

jumon
Jan 13, 2023

Thank you for your great work and for making it open-source! I am currently writing code to fine-tune a Whisper model "with timestamps" (https://github.com/jumon/whisper-finetuning) and have a few questions regarding it.

How often was the model trained using the "notimestamps" token? I found in the paper that the model was trained using previous text with a probability of 0.5, but I couldn't find the part mentioning the probability of using the "notimestamps" token.
Are timestamps tokens included in prompts during training? It appears that timestamps are included in prompts during inference (in the model.transcribe function), but I am uncertain if they are included during training and the reasoning behind it.
Are timestamps trained using one-hot labels or some form of soft labels? For example, when there is a timestamp label of "<|1.02|>", would it be beneficial to also take into account neighboring timestamps, such as "<|1.00|>" and "<|1.04|>", in the training loss?
(the duplication of Which value was used for masked values of SpecAugment? #730) Which value was used for masked values of SpecAugment during training? Was it 0? My concern with using 0 for masking is that it is also used for padding audio, and this could potentially lead to the model "hallucinating" at the end of shorter audio files (less than 30 seconds) during recognition.
(This question is not indeed necessary for fine-tuning, but I am curious) The paper states that the model was trained on 680,000 hours of data, but is this amount before or after the "additional filtering" step stated in the paper?

For an additional filtering pass, after training an initial model we aggregated information about its error rate on training data sources and performed manual inspection of these data sources sorting by a combination of both high error rate and data source size in order to identify and remove low-quality ones efficiently.

Thank you in advance for your time and help.

Answered by jongwook

Mar 7, 2023

Hi!

The <|notimestamps|> was used 50% of the samples
timestamp tokens were included in the prompt when not using <|notimestamps|> (50% of the time), and not included in the prompt when using <|notimestamps|> (the other 50% of the time). In practice, the model will mostly behave as expected with or without the timestamp tokens in the prompt.
It was trained as one-hot labels, and many training examples started at <|0.00|> timestamp, which resulted in a huge bias on that token as well as on the integer timestamps. I think some form of soft labels like you suggested would mitigate this issue.
That's a great point! I should not zero-pad the spectrogram but zero-pad the audio and then convert …

View full answer

RaulKite · 2023-01-15T20:11:11Z

RaulKite
Jan 15, 2023

Have you checked and measure improvements aftee training the new model with timestamps? Does it get better timestamps for segments?

If you see a clear improvement I can let a hand with spanish models.

I think that better segments timestamps + ASR (pytorch Wav2vec2) to word level timestamps can be a good improvement also.

Thi, with reducing hallucinations with params like beam_search and previous text to false maybe can boost general performance using whisper.

3 replies

jumon Jan 17, 2023
Author

Thanks!

Have you checked and measure improvements aftee training the new model with timestamps?

I am currently working on it and will update the repo when I get the results.

Does it get better timestamps for segments?

I believe it highly depends on the dataset we use for fine-tuning. If we fine-tune a model with a dataset that has accurate timestamps (and my code is functioning correctly), there should be an improvement.

RaulKite Jan 31, 2023

Hi,

Any advance of this? I have visited the project and still WIP in description, but a lot of commits in it.

It will be nice to have a "how is it working" report :-)

Thanks in advance :-)

jumon Feb 7, 2023
Author

Hi, thanks for reaching out.
I am currently still working on this project as there are some unexpected behaviors, such as a significant increase in loss during fine-tuning.
However, I was able to fine-tune Whisper-medium with timestamps on the CSJ (Japanese) dataset, which led to an improvement in the Character Error Rate (CER), from 14.7% to 4.8%. It is important to note that the improvement is largely due to the adaptation of the written style to the CSJ dataset, rather than a pure recognition improvement. Despite this, the code is still functional but may contain a few bugs.

flesnuk · 2023-01-18T10:05:31Z

flesnuk
Jan 18, 2023

Btw in the hugginface version they just added timestamps implementation : huggingface/transformers#20620
I don't know if the code from that PR can be useful for a whisper based finetuning with timestamps
There is also this repo for whisper finetuning but no timestamps: https://github.com/sarulab-speech/whisper-asr-finetune
based on #64 (comment)

0 replies

jongwook · 2023-03-07T02:33:07Z

jongwook
Mar 7, 2023
Maintainer

Hi!

The <|notimestamps|> was used 50% of the samples
timestamp tokens were included in the prompt when not using <|notimestamps|> (50% of the time), and not included in the prompt when using <|notimestamps|> (the other 50% of the time). In practice, the model will mostly behave as expected with or without the timestamp tokens in the prompt.
It was trained as one-hot labels, and many training examples started at <|0.00|> timestamp, which resulted in a huge bias on that token as well as on the integer timestamps. I think some form of soft labels like you suggested would mitigate this issue.
That's a great point! I should not zero-pad the spectrogram but zero-pad the audio and then convert to a Mel spectrogram. I've been quite lazy to fix this :(
This was after.

Hope this helps!

4 replies

jumon Mar 7, 2023
Author

Thank you for answering the questions! Everything is now clear. Thanks again for your significant contributions to this project!

lunixbochs Mar 7, 2023

That's a great point! I should not zero-pad the spectrogram but zero-pad the audio and then convert to a Mel spectrogram. I've been quite lazy to fix this :(

How will that make a difference? The FFT of zeroes is still zeroes. Isn't the answer to either pad with non-zero data, or to do SpecAugment with non-zero data?

jongwook Mar 7, 2023
Maintainer

We're using log mel spectrogram with clamping, so it's not zeros but some negative number. On the other hand, zero values in log-mel spectrogram would translate to a white noise of a moderate volume in the audio domain. The model has been behaving pretty fine with either from of padding, but I haven't done much comparison to make sure that zero-padding in the audio domain is definitely better (it should, theoretically, because it's what the model has seen during training)

lunixbochs Mar 7, 2023

Thanks, I missed that detail - in practice it looks like zero-padding the audio with your featurization creates a spectrogram of all -1.5000.

AvivSham · 2023-06-28T13:10:19Z

AvivSham
Jun 28, 2023

Hi all,
I have a common question regarding fine-tuning whisper with prompts. How can I finetune whisper's decoder by performing optimization steps including prompts?

0 replies

Mayuresh-MLE · 2025-06-11T10:17:44Z

Mayuresh-MLE
Jun 11, 2025

hi @jongwook I have bee trying to do the timestamp aware fine tuning on whisper.. how should the data look like before passing it to the model?.. I tried to transform the data I have which is without timestamps. so to add them I used VAD on audio files and tried to map them with the corresponding text. I made the text like this "<|0.00|> The laptop was a gaming laptop, and we could play games on it too <|5.12|>", but after running the training loop the model does not seem to revert timestamps even after trying all relevant hyper parameters.
Is the format correct or is there anything I am missing?.

0 replies

Questions regarding fine-tuning a Whisper model with timestamps #838

Uh oh!

Uh oh!

Replies: 5 comments · 7 replies

Uh oh!

Uh oh!

jumon Jan 17, 2023 Author

Uh oh!

Uh oh!

jumon Feb 7, 2023 Author

Uh oh!

Uh oh!

jongwook Mar 7, 2023 Maintainer

Uh oh!

jumon Mar 7, 2023 Author

Uh oh!

Uh oh!

jongwook Mar 7, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 7 replies

jumon Jan 17, 2023
Author

jumon Feb 7, 2023
Author

jongwook
Mar 7, 2023
Maintainer

jumon Mar 7, 2023
Author

jongwook Mar 7, 2023
Maintainer