(Possible) redundant computation in timing.py #1664
Unanswered
shauncassini
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I've been messing around with Whisper for some research in speech translation. I'm currently looking at Whisper's ability to infer word-level timestamps. From what I understand, each audio signal is split into 30s segments and a "winning" hypothesis transcription is found for that segment (i.e. decoding with beam search). The word-level timestamps are then found based off of this hypothesis and the attention weights of the decoder (pretty cool stuff).
I think I've spotted redundant computation when
find_alignment
is called (withinadd_word_timestamps
).find_alignment
seems to re-encode the audio segment in order to get the necessary logits for alignment. This is computed in line 195 infind_alignment
:logits = model(mel.unsqueeze(0), tokens.unsqueeze(0))[0]
. Instead, why not pass the already computed audio representations? this change looks likelogits = model.logits(tokens.unsqueeze(0), audio_features.unsqueeze(0))[0]
. The audio features come fromresult.audio_features
, computed in line 234 ofresult.py
.I may be missing something, though - is there a reason why the representations are being recomputed?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions