Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented Mar 31, 2025

The possessive quantifiers are causing weird issues, and atomic grouping does not seem to be supported, so revert to greedy.

See following reports:

@bartowski1182 @nicoboss

@CISC CISC requested a review from ggerganov March 31, 2025 20:43
@ggerganov
Copy link
Member

Did you do tokenizer tests to make sure the results match with the reference tokenizer?

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

Did you do tokenizer tests to make sure the results match with the reference tokenizer?

I did some basic tests, but I will admit that I'm not entirely sure how to set up a proper test, any pointers?

@ggerganov
Copy link
Member

You need to run:

python convert_hf_to_gguf_update.py <hf_token>

This will download reference tokenizers for all models to models/tokenizers and will generate test files in models/ggml-vocab-...inp/out.

After that, create a "vocab-only" GGUF model:

# this is for llama - update to create one for the Bailing model
python3 convert_hf_to_gguf.py models/tokenizers/llama-spm/ --outfile models/ggml-vocab-llama-spm.gguf --vocab-only

Run the test-tokenizer-0 tool using the GGUF vocab and generated test files.

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

Run the test-tokenizer-0 tool using the GGUF vocab and generated test files.

Ah, I missed the relationship between these files, I see now, added for future tests.

main : tokenizing: 'ggml-vocab-ling-plus.gguf.inp'
main : text size: 1944
main : tokenized in 1.466 ms (cpp)
main : tokens: 814
main : tokens written to 'ggml-vocab-ling-plus.gguf.inp.tokcpp'

Tests passed

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

@ggerganov gentle ping :)

@ggerganov
Copy link
Member

Nice, but I didn't mean to commit the generated test files. At some point we stopped source controlling them because they add non negligible data (in this case 5MB). The idea is to just generate and test them locally.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge after removing the vocab files.

@CISC CISC merged commit 83a88bd into ggml-org:master Apr 2, 2025
47 checks passed
@CISC CISC deleted the fix-bailing-vocab-regex branch April 2, 2025 09:21
@bartowski1182
Copy link
Contributor

bartowski1182 commented Apr 3, 2025

huh, seems to be CUDA related? If I switch to CPU only it's able to tokenize no problem..

False alarms all around, was accidentally using an older build on my GPU... ignore me :) thanks so much for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants