Tokenizer optimization #917

tllmmaster · 2025-11-25T03:39:05Z

tllmmaster
Nov 25, 2025

I am training a BPE tokenizer (32k vocab) on a 1GB dataset for an agglutinative language (Turkmen). I noticed significant redundancy in the vocabulary (e.g., separate tokens for 'Turkmenistan', 'türkmenistan', 'TURKMENISTAN'), which wastes valuable vocabulary slots.I was tempted to manually prune or merge these entries in the JSON file post-training, but I know this might break the BPE merge rules. What is the best practice here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer optimization #917

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Tokenizer optimization #917

Uh oh!

tllmmaster Nov 25, 2025

Replies: 0 comments

tllmmaster
Nov 25, 2025