Encoder Output smaller than 1500 frames? #2214

nkumar248 · 2024-06-09T15:59:34Z

nkumar248
Jun 9, 2024

Hello,

I just started working with whisper and want to extract the encoder output. For this I am using this fork that provides the encodings for each segment in the transcription results: #85

I am testing it like this on audio from the GTZAN dataset:

model = whisper.load_model("large")
input_file = "hiphop.00014.wav"

print(f"Information on {os.path.basename(input_file)}")
result = model.transcribe(input_file)

num_segments = len(result["segments"])
print(f"Number of segments: {num_segments}\n")

total_frames = 0
for i in range(num_segments):
    segment_frames = result["segments"][i]["encoder_embeddings"].shape[2]
    print("Segment " + str(i) + " dimensions: " + str(result["segments"][i]["encoder_embeddings"].shape))
    total_frames += segment_frames

print(f"Total frames across all segments: {total_frames}")
print(result["text"])

The output for some audios looks like this:

Information on hiphop 00001.wav
Segment 0 dimensions: (1, 5, 176, 384)
Segment 1 dimensions: (1, 5, 102, 384)
Segment 2 dimensions: (1, 5, 100, 384)
Segment 3 dimensions: (1, 5, 132, 384)
Segment 4 dimensions: (1, 5, 130, 384)
Segment 5 dimensions: (1, 5, 132, 384)
Segment 6 dimensions: (1, 5, 132, 384)
Segment 7 dimensions: (1, 5, 130, 384)
Segment 8 dimensions: (1, 5, 132, 384)
Segment 9 dimensions: (1, 5, 118, 384)
Segment 10 dimensions: (1, 5, 142, 384)
Segment 11 dimensions: (1, 5, 73, 384)
Total frames across all segments: 1499

But for some audios it looks like this:

Information on hiphop 00014.wav
Number of segments: 4
Segment 0 dimensions: (1, 33, 100, 1280)
Segment 1 dimensions: (1, 33, 150, 1280)
Segment 2 dimensions: (1, 33, 150, 1280)
Segment 3 dimensions: (1, 33, 550, 1280)
Total frames across all segments: 950

And for most audios I do get 1500 as total number of frames even when there is no speech in the audio. But for some samples I get only e.g. 950 total frames.

My questions are:

Why do I get less than 1500 frames for some audios?
How would I align these 950 frames with encodings from another audio model?

matthewkperez · 2024-10-25T15:53:59Z

matthewkperez
Oct 25, 2024

I had this same question. Apparently whisper uses a fixed length encoder of 30s long with any audio less than this getting zero-padded [openai, supp-paper] . As for why it is 1500 frames in particular, this is due to the log-mel spectrogram (size 3000 for 30 seconds of audio), which after the convolution layers in the encoder gets reduced to 1500 [code w/params for mel spectrogram computation].

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Encoder Output smaller than 1500 frames? #2214

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Encoder Output smaller than 1500 frames? #2214

Uh oh!

Uh oh!

nkumar248 Jun 9, 2024

Replies: 1 comment

Uh oh!

Uh oh!

matthewkperez Oct 25, 2024

nkumar248
Jun 9, 2024

matthewkperez
Oct 25, 2024