Does fine-tuning Whisper without timestamp affects long-form transcription performance? #1703
Unanswered
jeewooyoon-raondata
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @jongwook
Thank you for sharing this great open-source :)
I successfully fine-tuned the Whisper medium model with Korean speech data (10~15 secs each -> padded 30 secs).
I revealed that the fine-tuned model performs better (i.e., C/WER) than the Whisper large-v2 model with short-form (i.e., under 30s) speech data. => fine-tune: 20% CER vs. large-v2: 28% CER
However, when I try to perform long-form transcription with the transcribe function, the model returns unstable outputs.
==> Changing decoding option parameters increases the performance slightly but is still unstable.
So, my question is ...
What can be the reason for this situation?
Maybe because I fine-tuned the model without timestamps?
Or most of the data used in fine-tuning is shorter than the original training data? (30 secs)
Beta Was this translation helpful? Give feedback.
All reactions