Skip to content
Discussion options

You must be logged in to vote

If you need just one more token, you could re-purpose <|startoflm|> which wasn't used during training (more context on this token in #414 (comment)):

specials = [
"<|startoftranscript|>",
*[f"<|{lang}|>" for lang in LANGUAGES.keys()],
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nospeech|>",
"<|notimestamps|>",
]

If there are multiple special tokens, you can add them in the list above, and resize the token embedding tensor to account for the new vocab size. You would also need to edit a few places where the vocab size is hard-coded, like:

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by jongwook
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants