How whisper managed previous context during training? #2476

joseluis-recog · 2024-12-27T17:45:43Z

joseluis-recog
Dec 27, 2024

I was wondering whether Whisper's developers make use of a padding token during batch training. Specifically, I’ve been experimenting with feeding the context (the tokens situated between <|startofprev|> and <|startoftranscript|>) during training. To facilitate batching, I need to pad these contexts to match the length of the longest one in the batch.

However, I couldn’t find any documentation or references regarding the use of padding tokens during Whisper’s training process. I’ve tried various padding approaches, such as padding on the left, padding on the right, using -100, and even using token ID 50256 (which corresponds to the empty string, ""). In all cases, Whisper seems to output random, nonsensical tokens in response to the padded inputs.

This behavior leads me to suspect that instead of padding, Whisper's developers might have truncated the contexts to the shortest length in the batch during training. If that’s the case, it would explain why the model doesn’t recognize any specific padding pattern.

If anyone has insights into how this was handled during training or knows the correct approach, it would be incredibly helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How whisper managed previous context during training? #2476

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How whisper managed previous context during training? #2476

Uh oh!

joseluis-recog Dec 27, 2024

Replies: 0 comments

joseluis-recog
Dec 27, 2024