(Possible) redundant computation in timing.py #1664

shauncassini · 2023-09-18T16:58:23Z

shauncassini
Sep 18, 2023

Hello,

I've been messing around with Whisper for some research in speech translation. I'm currently looking at Whisper's ability to infer word-level timestamps. From what I understand, each audio signal is split into 30s segments and a "winning" hypothesis transcription is found for that segment (i.e. decoding with beam search). The word-level timestamps are then found based off of this hypothesis and the attention weights of the decoder (pretty cool stuff).

I think I've spotted redundant computation when find_alignment is called (within add_word_timestamps). find_alignment seems to re-encode the audio segment in order to get the necessary logits for alignment. This is computed in line 195 in find_alignment: logits = model(mel.unsqueeze(0), tokens.unsqueeze(0))[0]. Instead, why not pass the already computed audio representations? this change looks like logits = model.logits(tokens.unsqueeze(0), audio_features.unsqueeze(0))[0]. The audio features come from result.audio_features, computed in line 234 of result.py.

I may be missing something, though - is there a reason why the representations are being recomputed?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(Possible) redundant computation in timing.py #1664

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

(Possible) redundant computation in timing.py #1664

Uh oh!

shauncassini Sep 18, 2023

Replies: 0 comments

shauncassini
Sep 18, 2023