You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently fine-tuning openai/whisper-large-v3-turbo and wanted to share some thoughts—and ask for feedback—on the use of whisper.pad_or_trim() vs dynamic padding when preparing the training dataset.
Context:
The official Whisper preprocessing pipeline uses whisper.pad_or_trim(audio) to force input audio to exactly 30 seconds (480,000 samples).
I experimented with both approaches: using pad_or_trim() for fixed-length input and using dynamic padding (based on max length per batch) with proper attention_mask.
My observations:
Criteria
whisper.pad_or_trim()
Dynamic Padding
GPU memory usage
High and fixed
More efficient
Training speed
Slower due to zero-padding
Faster
Loss stability
Slightly more stable
Occasionally fluctuates
Alignment / hallucination
Slightly better
Acceptable, but may need filtering
Match to pretraining setup
✅ Fully aligned
❌ Slightly deviates
Suitability for short audios
❌ Over-padding
✅ Efficient
Conclusion so far:
From a training data preparation perspective, pad_or_trim() seems inefficient, especially when most audio samples are under 30 seconds.
Dynamic padding performs better in terms of speed and memory usage, and works well as long as attention_mask and positional embeddings are correctly handled.
For inference, I might still consider pad_or_trim() to match pretraining behavior and minimize alignment issues.
❓ Questions for the community:
Has anyone else compared training outcomes using both methods?
Are there any known issues with dynamic padding + LoRA in Whisper models?
For models like whisper-large-v3-turbo, is there any strong reason to stick with pad_or_trim for stability?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I'm currently fine-tuning
openai/whisper-large-v3-turbo
and wanted to share some thoughts—and ask for feedback—on the use ofwhisper.pad_or_trim()
vs dynamic padding when preparing the training dataset.Context:
The official Whisper preprocessing pipeline uses
whisper.pad_or_trim(audio)
to force input audio to exactly 30 seconds (480,000 samples).I experimented with both approaches: using
pad_or_trim()
for fixed-length input and using dynamic padding (based on max length per batch) with properattention_mask
.My observations:
Conclusion so far:
From a training data preparation perspective,
pad_or_trim()
seems inefficient, especially when most audio samples are under 30 seconds.Dynamic padding performs better in terms of speed and memory usage, and works well as long as
attention_mask
and positional embeddings are correctly handled.For inference, I might still consider
pad_or_trim()
to match pretraining behavior and minimize alignment issues.❓ Questions for the community:
Has anyone else compared training outcomes using both methods?
Are there any known issues with dynamic padding + LoRA in Whisper models?
For models like
whisper-large-v3-turbo
, is there any strong reason to stick withpad_or_trim
for stability?Thanks in advance for your thoughts!
Beta Was this translation helpful? Give feedback.
All reactions