-
To fine-tune whisper for a new task, I want to add a non-text token, which whisper should learn to insert in its output in proper places (adding one to tokenizer's 51865 tokens). as @jongwook has explained in #620, I can't add it to special tokens, because it will overrun the timestamp tokens. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
If you need just one more token, you could re-purpose Lines 279 to 288 in 0b5dcfd If there are multiple special tokens, you can add them in the list above, and resize the token embedding tensor to account for the new vocab size. You would also need to edit a few places where the vocab size is hard-coded, like: Lines 229 to 231 in 0b5dcfd |
Beta Was this translation helpful? Give feedback.
-
@jongwook Can you shed some more light on resize the token embedding tensor? Cuz I checked the |
Beta Was this translation helpful? Give feedback.
-
@jongwook another question: since the vocab size is hardcoded, if I add two special tokens I just increase the current value by 2 correct? sorry, a bit new with the whisper architecture. |
Beta Was this translation helpful? Give feedback.
If you need just one more token, you could re-purpose
<|startoflm|>
which wasn't used during training (more context on this token in #414 (comment)):whisper/whisper/tokenizer.py
Lines 279 to 288 in 0b5dcfd
If there are multiple special tokens, you can add them in the list above, and resize the token embedding tensor to account for the new vocab size. You would also need to edit a few places where the vocab size is hard-coded, like:
w…