Adding a (special) token #658

Majdoddin · 2022-12-08T18:56:28Z

Majdoddin
Dec 8, 2022

To fine-tune whisper for a new task, I want to add a non-text token, which whisper should learn to insert in its output in proper places (adding one to tokenizer's 51865 tokens).
But how should I add the token? Should I then modify the pre-trained model by adding a logit for the new token, and then train it? Can someone provide a sample code?

as @jongwook has explained in #620, I can't add it to special tokens, because it will overrun the timestamp tokens.
I am also reluctant to use one the tokens that are already there, because the model is already trained on it, and I don't want to mess it up.

Answered by jongwook

Dec 9, 2022

If you need just one more token, you could re-purpose <|startoflm|> which wasn't used during training (more context on this token in #414 (comment)):

whisper/whisper/tokenizer.py

Lines 279 to 288 in 0b5dcfd

     specials = [  
   "<|startoftranscript|>",  
   *[f"<|{lang}|>" for lang in LANGUAGES.keys()],  
   "<|translate|>",  
   "<|transcribe|>",  
   "<|startoflm|>",  
   "<|startofprev|>",  
   "<|nospeech|>",  
   "<|notimestamps|>",  
   ]  

 

If there are multiple special tokens, you can add them in the list above, and resize the token embedding tensor to account for the new vocab size. You would also need to edit a few places where the vocab size is hard-coded, like:

w…

View full answer

jongwook · 2022-12-09T08:13:42Z

jongwook
Dec 9, 2022
Maintainer

If you need just one more token, you could re-purpose <|startoflm|> which wasn't used during training (more context on this token in #414 (comment)):

whisper/whisper/tokenizer.py

Lines 279 to 288 in 0b5dcfd

    
           specials = [ 
        
               "<|startoftranscript|>", 
        
               *[f"<|{lang}|>" for lang in LANGUAGES.keys()], 
        
               "<|translate|>", 
        
               "<|transcribe|>", 
        
               "<|startoflm|>", 
        
               "<|startofprev|>", 
        
               "<|nospeech|>", 
        
               "<|notimestamps|>", 
        
           ]

If there are multiple special tokens, you can add them in the list above, and resize the token embedding tensor to account for the new vocab size. You would also need to edit a few places where the vocab size is hard-coded, like:

whisper/whisper/model.py

Lines 229 to 231 in 0b5dcfd

    
           @property 
        
           def is_multilingual(self): 
        
               return self.dims.n_vocab == 51865

0 replies

sleepingcat4 · 2024-12-26T19:27:42Z

sleepingcat4
Dec 26, 2024

@jongwook Can you shed some more light on resize the token embedding tensor? Cuz I checked the model.py file and I don't think resize is necessary as there wasn't any hardcoded number.

0 replies

sleepingcat4 · 2024-12-26T19:37:47Z

sleepingcat4
Dec 26, 2024

@jongwook another question: since the vocab size is hardcoded, if I add two special tokens I just increase the current value by 2 correct? sorry, a bit new with the whisper architecture.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding a (special) token #658

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

	specials = [
	"<\|startoftranscript\|>",
	*[f"<\|{lang}\|>" for lang in LANGUAGES.keys()],
	"<\|translate\|>",
	"<\|transcribe\|>",
	"<\|startoflm\|>",
	"<\|startofprev\|>",
	"<\|nospeech\|>",
	"<\|notimestamps\|>",
	]

Adding a (special) token #658

Uh oh!

Uh oh!

Majdoddin Dec 8, 2022

Replies: 3 comments

Uh oh!

jongwook Dec 9, 2022 Maintainer

Uh oh!

sleepingcat4 Dec 26, 2024

Uh oh!

sleepingcat4 Dec 26, 2024

Majdoddin
Dec 8, 2022

jongwook
Dec 9, 2022
Maintainer

sleepingcat4
Dec 26, 2024

sleepingcat4
Dec 26, 2024