Why the vocab size is 51865? #361

HKAB · 2022-10-18T09:09:28Z

HKAB
Oct 18, 2022

I noticed the n_vocab size is: 51865. However, the tokenizer only have 50364 unique token.

len(whisper.tokenizer.get_tokenizer('vi').tokenizer)
# 50364

checkpoint['dims']
# {'n_mels': 80,
#  'n_vocab': 51865,
#  'n_audio_ctx': 1500,
#  'n_audio_state': 384,
#  'n_audio_head': 6,
#  'n_audio_layer': 4,
#  'n_text_ctx': 448,
#  'n_text_state': 384,
#  'n_text_head': 6,
#  'n_text_layer': 4}

Can you explain the output size of the model for me? It looks like that sometimes the model generate empty token which is out of tokenizer's vocabulary size.

Answered by jianfch

Oct 18, 2022

50364 and after are all timestamp tokens. 50364 is time 0s.

View full answer

jianfch · 2022-10-18T16:20:44Z

jianfch
Oct 18, 2022

50364 and after are all timestamp tokens. 50364 is time 0s.

0 replies

daxaxelrod · 2023-04-03T15:42:08Z

daxaxelrod
Apr 3, 2023

Is there anywhere we can view the entire lexicon map? Which int maps to which english token?

2 replies

HKAB Apr 3, 2023
Author

simply using for loop from 1 to 50363 and use Whisper tokenizer to decode it.

daxaxelrod Apr 3, 2023

Figured as such, wondering if i could save a repl session haha

shuaijiang · 2023-10-31T09:24:42Z

shuaijiang
Oct 31, 2023

reference to this line

whisper/whisper/tokenizer.py

Line 344 in b38a1f2

*[f"<|{i * 0.02:.2f}|>" for i in range(1501)],

`` *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],

from 50364 to 51864， each is a timestamp from <|0.00|> to <|30.00|> with 0.02 interval

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why the vocab size is 51865? #361

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why the vocab size is 51865? #361

Uh oh!

HKAB Oct 18, 2022

Replies: 3 comments · 2 replies

Uh oh!

jianfch Oct 18, 2022

Uh oh!

daxaxelrod Apr 3, 2023

Uh oh!

HKAB Apr 3, 2023 Author

Uh oh!

daxaxelrod Apr 3, 2023

Uh oh!

Uh oh!

shuaijiang Oct 31, 2023

HKAB
Oct 18, 2022

Replies: 3 comments 2 replies

jianfch
Oct 18, 2022

daxaxelrod
Apr 3, 2023

HKAB Apr 3, 2023
Author

shuaijiang
Oct 31, 2023