Whisper fine-tuning: Validation loss increases but WER is decreasing. #2575

bansal-sid · 2025-04-14T11:48:38Z

bansal-sid
Apr 14, 2025

Hello,
I referred https://huggingface.co/blog/fine-tune-whisper for fine-tuning large model of whisper for English. One of my observation was that training loss is decreasing, and validation loss is increasing which is a classic example of overfitting.
But, WER is decreasing on the validation set.
The same behavior can also be seen in the above-shared link. I'm a bit confused about this.
Is it okay to take such models? What can be the reasons behind this?
@sanchit-gandhi , please reply if possible.

hwanython · 2025-04-15T08:33:14Z

hwanython
Apr 15, 2025

Yes, I saw same thing. I think this behavior is actually not unusual when working with Whisper or other sequence-to-sequence models.

Situation Recap

Training loss is decreasing
Validation loss is increasing
But validation WER (Word Error Rate) is decreasing

At first glance, it might seem contradictory. How can the model be getting worse according to the loss, but better according to WER?

Short Answer

Yes, this can be perfectly fine — especially if WER is the main metric you care about.

Why This Happens

1. The loss function is different from your evaluation metric.
Whisper is trained using cross-entropy loss on token sequences, but WER is a sequence-level metric based on decoded outputs. The two don’t always correlate perfectly. A model might assign slightly lower probabilities to correct tokens (increasing loss) but still generate more accurate or fluent transcriptions after decoding, leading to a lower WER.

2. Training uses teacher forcing, evaluation uses autoregressive decoding.
During training, the model sees the ground truth at every step (teacher forcing), but at inference time, it generates tokens one by one. This mismatch can cause validation loss to behave differently than WER. Slight overfitting on token-level predictions might not harm — and may even help — the actual decoded output.

3. WER is what matters for real-world performance.
If your goal is a model that transcribes better, and WER is going down, then that model is doing its job — even if the validation loss increases a bit.

Should You Keep Such a Model?

Yes — if your downstream task prioritizes WER (or CER), it’s reasonable to select the model checkpoint with the lowest WER, not necessarily the lowest validation loss. In most ASR use cases, sequence-level accuracy matters more than per-token log-likelihood.

Take care for real-world not math world

1 reply

bansal-sid Apr 15, 2025
Author

As you mentioned, validation loss can increase. As per the image shared validation loss rises from 0.64 to 0.8 and training is still running. Is this a major increase or we can consider the resultant model as WER is decreasing?

hwanython · 2025-04-16T05:24:09Z

hwanython
Apr 16, 2025

Yes, refered to this
https://github.com/SeanNaren/deepspeech.pytorch/issues/78#issuecomment-1762817691

the dataset size can affect that

2 replies

bansal-sid Apr 16, 2025
Author

Any recommendation on how much data should be used for training. I used 10 hours data to train the large model. Since, whisper already performs good for English, I wanted to fine-tune it more using some customer data.

hwanython May 12, 2025

I fine-tuned the model using approximately 10 hours of Korean audio data, specifically tailored to the medical domain. Despite the relatively small dataset, I observed a noticeable improvement in the recognition of medical terminology, which suggests that domain-specific fine-tuning can yield valuable results even with limited data.

However, I’d recommend using LoRA (Low-Rank Adaptation) instead of full fine-tuning, as the latter can sometimes lead to hallucination effects, especially with small or narrow datasets.

One of the most critical factors in fine-tuning Whisper is the quality of alignment between audio and transcription. Be sure to verify that your dataset has clean, well-synced pairs of audio and text — this often matters more than the sheer size of the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper fine-tuning: Validation loss increases but WER is decreasing. #2575

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Whisper fine-tuning: Validation loss increases but WER is decreasing. #2575

Uh oh!

bansal-sid Apr 14, 2025

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

hwanython Apr 15, 2025

Situation Recap

Short Answer

Why This Happens

Should You Keep Such a Model?

Uh oh!

bansal-sid Apr 15, 2025 Author

Uh oh!

hwanython Apr 16, 2025

Uh oh!

bansal-sid Apr 16, 2025 Author

Uh oh!

hwanython May 12, 2025

bansal-sid
Apr 14, 2025

Replies: 2 comments 3 replies

hwanython
Apr 15, 2025

bansal-sid Apr 15, 2025
Author

hwanython
Apr 16, 2025

bansal-sid Apr 16, 2025
Author