Open
Conversation
SatishWG
reviewed
Feb 12, 2026
pmgarg
approved these changes
Feb 25, 2026
rraghu214
approved these changes
Feb 25, 2026
2) Selected tokenizer metrics evaluation code has been added. 3) README.md has been updated
- Modified [build_clean_tokenizer.py](cci:7://file:///e:/main_llm_token/LLM/experiments/6_tokenizer_design_lab/build_clean_tokenizer.py:0:0-0:0) targeting 131,046 base tokens to accommodate 26 post-processing special tokens, hitting exactly 131,072. - Updated [README.md](cci:7://file:///e:/main_llm_token/LLM/experiments/6_tokenizer_design_lab/README.md:0:0-0:0) with new token counts, structure table, correct special token IDs (131046+), and added reproduction instructions. - Fixed path separator issue in `build_clean_tokenizer.py` for Windows compatibility. - Regenerated tokenizer artifacts.
- Renamed gptoss_pruning to tsai_131k_tokenizer - Restructured layout: gptoss tokens first (IDs 0-130715), special tokens at end (IDs 130716-131071) - Merged base (80) + additional (26) + reserved (250) = 356 special tokens - Updated build_clean_tokenizer.py with relative paths - Added special_tokens.py for token definitions - Removed add_special_tokens.py (merged into build script) - Updated README with new structure
2) Kronecker embeddings logic is added
- Add clear reproduction steps for generating the tokenizer and embeddings.
…um script - Replaced tail truncation with automatic PAD token backfilling across all target scripts - Added explicit `--drop-remainder` CLI flag with rigid dropped-token logging. - Isolated [tokenize_curriculum.py] into a dedicated `scripts/curriculum_tokenizer/` nested directory. - Added comprehensive EC2 execution and SSH instructions to a localized `README.md`.
Tokenizer Audit and Fixes (8 changes): 1. Fix bos_token in tokenizer_config.json: stale startoftext from base GPT-OSS, now begin_of_text 2. Fix eos_token in tokenizer_config.json: stale return from base GPT-OSS, now end_of_text 3. Fix pad_token in tokenizer_config.json: stale endoftext, now dedicated pad token 4. Fix pad_token in special_tokens_map.json: was same as EOS, now dedicated pad token 5. Add unk_token to special_tokens_map.json: was missing entirely 6. Add unk_token to tokenizer_config.json: was missing entirely 7. Fix build_clean_tokenizer.py special_tokens_map output: pad_token and unk_token 8. Fix build_clean_tokenizer.py tokenizer_config output: add bos/eos/pad/unk overrides Build Script Hardening: - build_clean_tokenizer.py now explicitly sets bos/eos/pad/unk in tokenizer_config.json - Prevents regression if tokenizer is rebuilt from base GPT-OSS source Audit Infrastructure: - Add tokenizer_audit/ folder with audit_tokenizer.py and audit_deep.py - Main audit: 87 PASS, 0 FAIL, 0 WARN - Deep audit: 0 issues (cross-file consistency, ID integrity, build script checks) - Rebuilt tokenizer from fixed build script to confirm clean output
e3aecae to
37caa87
Compare
- Added explicit filtering in vocabulary construction to remove tokens containing broken UTF-8 replacement characters (U+FFFD). - Blocked Private Use Area (PUA) codepoints (U+E000-U+F8FF) from being included in the tokenizer. - Reclaimed 25 parameter slots to preserve higher-quality general tokens while strictly maintaining the 131,072 (2^17) vocabulary budget.
…_131k_tokenizer_hybrid
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Template
Description
Please include a summary of the changes and the related issue. Highlight any key points that reviewers should focus on.
Checklist
Reviewers