Replies: 2 comments
-
Not really, that is built into the model. The 448 tokens is half for the prompt and half for the output. The input window is also fixed at 30 seconds of audio, so if you're talking about long sequences of audio, as opposed to long sequences of tokens, then that is itself also limited to 30 seconds. How to deal with these limits: If your 30 seconds of audio contains extremely rapid speech, enough to generate more than 224 tokens of output, (or you're dealing with a language that uses a higher concentration of tokens for the same duration), then try cutting up your audio into smaller pieces such that they fit within the token limit. On the other hand, if you're trying to squeeze more than 224 tokens into the prompt, you should crop that to 224. If you want to deal with audio longer than 30 seconds, you need to split it up into 30 second segments and stitch the results together (which is what the Whisper code does). |
Beta Was this translation helpful? Give feedback.
-
for the whisper model from huggingface, you can set the model.config.max_target_position=512 or any num |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi. I am dealing with long sequences of audio that ranges between 400 and 700 lengths in the decoder ids. Whisper by default only supports 448 but is there any way to change that?
Beta Was this translation helpful? Give feedback.
All reactions