-
Hello @jongwook , Thank you for your great work and for making it open-source! I am currently writing code to fine-tune a Whisper model "with timestamps" (https://github.com/jumon/whisper-finetuning) and have a few questions regarding it.
Thank you in advance for your time and help. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 7 replies
-
Have you checked and measure improvements aftee training the new model with timestamps? Does it get better timestamps for segments? If you see a clear improvement I can let a hand with spanish models. I think that better segments timestamps + ASR (pytorch Wav2vec2) to word level timestamps can be a good improvement also. Thi, with reducing hallucinations with params like beam_search and previous text to false maybe can boost general performance using whisper. |
Beta Was this translation helpful? Give feedback.
-
Btw in the hugginface version they just added timestamps implementation : huggingface/transformers#20620 |
Beta Was this translation helpful? Give feedback.
-
Hi!
Hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Hi all, |
Beta Was this translation helpful? Give feedback.
-
hi @jongwook I have bee trying to do the timestamp aware fine tuning on whisper.. how should the data look like before passing it to the model?.. I tried to transform the data I have which is without timestamps. so to add them I used VAD on audio files and tried to map them with the corresponding text. I made the text like this "<|0.00|> The laptop was a gaming laptop, and we could play games on it too <|5.12|>", but after running the training loop the model does not seem to revert timestamps even after trying all relevant hyper parameters. |
Beta Was this translation helpful? Give feedback.
Hi!
<|notimestamps|>
was used 50% of the samples<|notimestamps|>
(50% of the time), and not included in the prompt when using<|notimestamps|>
(the other 50% of the time). In practice, the model will mostly behave as expected with or without the timestamp tokens in the prompt.<|0.00|>
timestamp, which resulted in a huge bias on that token as well as on the integer timestamps. I think some form of soft labels like you suggested would mitigate this issue.