Replies: 1 comment
-
I had this same question. Apparently whisper uses a fixed length encoder of 30s long with any audio less than this getting zero-padded [openai, supp-paper] . As for why it is 1500 frames in particular, this is due to the log-mel spectrogram (size 3000 for 30 seconds of audio), which after the convolution layers in the encoder gets reduced to 1500 [code w/params for mel spectrogram computation]. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I just started working with whisper and want to extract the encoder output. For this I am using this fork that provides the encodings for each segment in the transcription results: #85
I am testing it like this on audio from the GTZAN dataset:
The output for some audios looks like this:
But for some audios it looks like this:
And for most audios I do get 1500 as total number of frames even when there is no speech in the audio. But for some samples I get only e.g. 950 total frames.
My questions are:
Beta Was this translation helpful? Give feedback.
All reactions