Problem with Excessive Tokenization in ASR Model for Non-Latin Languages #1346

kikozi2000 · 2023-05-13T16:16:00Z

kikozi2000
May 13, 2023

Hello,

I'm currently encountering a challenge related to the tokenization. When the model transcribes audio into non-Latin text, it generates a significantly large number of tokens, even less then 30 seconds of speech.

This issue is leading to a situation where the number of output tokens of segment is exceeding the model's maximum token limit. With the current setting, the model can handle a maximum of 224 tokens (448/2), but the tokenized output is exceeding this limit, resulting in incomplete output.

If anyone has faced similar problems, or has any suggestions on how to fix this...

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem with Excessive Tokenization in ASR Model for Non-Latin Languages #1346

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Problem with Excessive Tokenization in ASR Model for Non-Latin Languages #1346

Uh oh!

kikozi2000 May 13, 2023

Replies: 0 comments

kikozi2000
May 13, 2023