Problem with Excessive Tokenization in ASR Model for Non-Latin Languages #1346
kikozi2000
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I'm currently encountering a challenge related to the tokenization. When the model transcribes audio into non-Latin text, it generates a significantly large number of tokens, even less then 30 seconds of speech.
This issue is leading to a situation where the number of output tokens of segment is exceeding the model's maximum token limit. With the current setting, the model can handle a maximum of 224 tokens (448/2), but the tokenized output is exceeding this limit, resulting in incomplete output.
If anyone has faced similar problems, or has any suggestions on how to fix this...
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions