Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented May 25, 2025

Adds nomic-bert-moe vocab for improved test coverage of UGM tokenizer.

@github-actions github-actions bot added the testing Everything test related label May 25, 2025
@CISC CISC requested a review from ggerganov May 25, 2025 11:39
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid adding more vocab files because they increase the size of the repository significantly. Ideally, the tests should download these files from HF and we can avoid source-controlling them here.

We can merge this one for now.

@CISC CISC merged commit aa50ba4 into master May 25, 2025
46 checks passed
@CISC CISC deleted the cisc/test-coverage-ugm-tokenizer branch May 25, 2025 14:22
@CISC
Copy link
Collaborator Author

CISC commented May 25, 2025

In that case it would make sense to have a vocab repo in ggml-org on HF?

@ggerganov
Copy link
Member

Yes. If you (or anyone else) is interested in implementing this, I will invite you to the HF org so you can set the repo. LMK

@CISC
Copy link
Collaborator Author

CISC commented May 25, 2025

Sure, I can do that, I sent a request.

@CISC
Copy link
Collaborator Author

CISC commented May 26, 2025

@ngxson I need the common_download_file_multiple function to be able to download and cache all the files in a repo, any thoughts around exposing this (and common_download_file_single I guess)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants