-
I noticed the n_vocab size is: 51865. However, the tokenizer only have 50364 unique token. len(whisper.tokenizer.get_tokenizer('vi').tokenizer)
# 50364 checkpoint['dims']
# {'n_mels': 80,
# 'n_vocab': 51865,
# 'n_audio_ctx': 1500,
# 'n_audio_state': 384,
# 'n_audio_head': 6,
# 'n_audio_layer': 4,
# 'n_text_ctx': 448,
# 'n_text_state': 384,
# 'n_text_head': 6,
# 'n_text_layer': 4} Can you explain the output size of the model for me? It looks like that sometimes the model generate empty token which is out of tokenizer's vocabulary size. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
50364 and after are all timestamp tokens. 50364 is time 0s. |
Beta Was this translation helpful? Give feedback.
-
Is there anywhere we can view the entire lexicon map? Which int maps to which english token? |
Beta Was this translation helpful? Give feedback.
-
reference to this line Line 344 in b38a1f2 `` *[f"<|{i * 0.02:.2f}|>" for i in range(1501)], from 50364 to 51864, each is a timestamp from <|0.00|> to <|30.00|> with 0.02 interval |
Beta Was this translation helpful? Give feedback.
50364 and after are all timestamp tokens. 50364 is time 0s.