-
Notifications
You must be signed in to change notification settings - Fork 13.3k
fix: convert_hf_to_gguf - change Jamba non-sentencepiece mode (tokeni… #16470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: convert_hf_to_gguf - change Jamba non-sentencepiece mode (tokeni… #16470
Conversation
…zer.json) vocab construction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works correctly, basically you're recreating a SPM vocab without scores, have you checked that tokenization is identical to AutoTokenizer
s (you can use convert_hf_to_gguf_update.py
to generate test files and test with test-tokenizer-0
)?
def get_vocab_base_pre(self, tokenizer) -> str: | ||
del tokenizer # unused | ||
|
||
return "gpt-2" | ||
return "default" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this method, only for BPE.
assert max(vocab.values()) < vocab_size | ||
|
||
tokpre = self.get_vocab_base_pre(tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert max(vocab.values()) < vocab_size | |
tokpre = self.get_vocab_base_pre(tokenizer) | |
assert max(vocab.values()) < vocab_size |
tokens.append(token) | ||
|
||
self.gguf_writer.add_tokenizer_model("llama") | ||
self.gguf_writer.add_tokenizer_pre(tokpre) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.gguf_writer.add_tokenizer_pre(tokpre) | |
self.gguf_writer.add_tokenizer_pre("default") |
self.gguf_writer.add_token_list(tokens) | ||
self.gguf_writer.add_token_types(toktypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No scores?
self.gguf_writer.add_token_list(tokens) | ||
self.gguf_writer.add_token_types(toktypes) | ||
|
||
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True) | |
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=False) |
token = tokenizer.decode( | ||
tokenizer.encode(token, add_special_tokens=False) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token = tokenizer.decode( | |
tokenizer.encode(token, add_special_tokens=False) | |
) | |
token = tokenizer.decode(tokenizer.encode(token, add_special_tokens=False)) |
logger.info( | ||
f"{repr(previous_token)} is encoded and decoded back to {repr(token)} using AutoTokenizer" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info( | |
f"{repr(previous_token)} is encoded and decoded back to {repr(token)} using AutoTokenizer" | |
) | |
logger.info(f"{repr(previous_token)} is encoded and decoded back to {repr(token)} using AutoTokenizer") |
if added_tokens_decoder[i].special or self.does_token_look_special( | ||
token | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if added_tokens_decoder[i].special or self.does_token_look_special( | |
token | |
): | |
if added_tokens_decoder[i].special or self.does_token_look_special(token): |
tokenizer = AutoTokenizer.from_pretrained( | ||
self.dir_model, trust_remote_code=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer = AutoTokenizer.from_pretrained( | |
self.dir_model, trust_remote_code=True | |
) | |
tokenizer = AutoTokenizer.from_pretrained(self.dir_model) |
The JambaModel implementation at
convert_hf_to_gguf.py
was incorrectly constructing its vocab using the gpt-2 tokenizer logic when no SentencePiece model was present (i.e., tokenizer.json path). Jamba actually uses a llama tokenizer, not gpt-2.This change updates the vocab build path to use the correct llama tokenizer for non-SentencePiece Jamba models. Also includes several small adjustments within the Jamba llama based tokenizer construction.
No changes are expected for other model types.
Testing
Verified with local conversion of Jamba GGUF model (tokenizer.json mode) and confirmed generated vocab matches the llama tokenizer layout. SentencePiece mode GGUF was also verified and it remains unaffected.