Tokenizer optimization #917
tllmmaster
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am training a BPE tokenizer (32k vocab) on a 1GB dataset for an agglutinative language (Turkmen). I noticed significant redundancy in the vocabulary (e.g., separate tokens for 'Turkmenistan', 'türkmenistan', 'TURKMENISTAN'), which wastes valuable vocabulary slots.I was tempted to manually prune or merge these entries in the JSON file post-training, but I know this might break the BPE merge rules. What is the best practice here?
Beta Was this translation helpful? Give feedback.
All reactions