Audio Feature Extraction using Whisper #1246

pegahsalehi · 2023-04-17T08:24:12Z

pegahsalehi
Apr 17, 2023

Hi,
I want to use the whisper to extract logits from audio using speechbrain.lobes.models.huggingface_whisper

from speechbrain.lobes.models.huggingface_whisper import HuggingFaceWhisper
import speechbrain as sb

model_hub_whisper = "openai/whisper-tiny"
model_whisper = HuggingFaceWhisper(model_hub_whisper, save_path='----')

source = sb.dataio.dataio.read_audio('./intro.wav').squeeze()
source = source.unsqueeze(0)
print(source.shape)
# This can be given as an argument when we instantiate the model as well
model_whisper.encoder_only=True
fea_whisper = model_whisper(source)
print(fea_whisper.shape)

how can I extract the logits from audio in the shape (X, Y).
That X is a fixed number when we use the whisper model and Y depends on the length of audio.

Answered by jongwook

May 5, 2023

I presume you mean audio features rather than logits since you're using only the encoder. The encoder always takes 30-second-long audio as input, and we trim or pad the audio to match this length. The encoded features are also 30-seconds long as a result. You can slice the features if the input was shorter than 30 seconds.

More code-specific questions could be better answered by speechbrain's maintainers.

View full answer

jongwook · 2023-05-05T09:36:45Z

jongwook
May 5, 2023
Maintainer

I presume you mean audio features rather than logits since you're using only the encoder. The encoder always takes 30-second-long audio as input, and we trim or pad the audio to match this length. The encoded features are also 30-seconds long as a result. You can slice the features if the input was shorter than 30 seconds.

More code-specific questions could be better answered by speechbrain's maintainers.

3 replies

pegahsalehi May 15, 2023
Author

I presume you mean audio features rather than logits since you're using only the encoder. The encoder always takes 30-second-long audio as input, and we trim or pad the audio to match this length. The encoded features are also 30-seconds long as a result. You can slice the features if the input was shorter than 30 seconds.

More code-specific questions could be better answered by speechbrain's maintainers.

Hi @jongwook , Thanks a lot for your help.
How can I extract logits in the shape (X, Y)?

LiuRicky Jul 12, 2024

I presume you mean audio features rather than logits since you're using only the encoder. The encoder always takes 30-second-long audio as input, and we trim or pad the audio to match this length. The encoded features are also 30-seconds long as a result. You can slice the features if the input was shorter than 30 seconds.

More code-specific questions could be better answered by speechbrain's maintainers.

@jongwook thanks for your reply, you mentioned that 'You can slice the features if the input was shorter than 30 seconds.'

I have a 6-second audio and input it to whisper encoder. The 6-second audio is pad to 30s, and the final feature shape is (1500, 1024). Can I just slice (:300, :) as my audio feature?

However, I print the features at (610:620, :), it is non-zero and I think these padded feature indeed involved in encoder transformer since 'mask is ignored in whisper encoder'. Based on these findings, can I just slice (:300, :) as my audio feature?

LiuRicky Jul 12, 2024

Will the slice incur feature loss since padded feature are involved in the whisper encoder?

pegahsalehi · 2024-08-05T13:25:05Z

pegahsalehi
Aug 5, 2024
Author

@LiuRicky, this is exactly my problem. Did you find any ideas?

0 replies

jeeyung · 2024-08-07T17:49:36Z

jeeyung
Aug 7, 2024

I am also facing the same problem

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audio Feature Extraction using Whisper #1246

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Audio Feature Extraction using Whisper #1246

Uh oh!

Uh oh!

pegahsalehi Apr 17, 2023

Replies: 3 comments · 3 replies

Uh oh!

jongwook May 5, 2023 Maintainer

Uh oh!

pegahsalehi May 15, 2023 Author

Uh oh!

LiuRicky Jul 12, 2024

Uh oh!

LiuRicky Jul 12, 2024

Uh oh!

pegahsalehi Aug 5, 2024 Author

Uh oh!

jeeyung Aug 7, 2024

pegahsalehi
Apr 17, 2023

Replies: 3 comments 3 replies

jongwook
May 5, 2023
Maintainer

pegahsalehi May 15, 2023
Author

pegahsalehi
Aug 5, 2024
Author

jeeyung
Aug 7, 2024