Skip to content

P06 tokenizer design lab#461

Open
AsangCode wants to merge 37 commits intostagingfrom
P06_Tokenizer_Design_Lab
Open

P06 tokenizer design lab#461
AsangCode wants to merge 37 commits intostagingfrom
P06_Tokenizer_Design_Lab

Conversation

@AsangCode
Copy link
Contributor

@AsangCode AsangCode commented Feb 11, 2026

Pull Request Template

Description

Please include a summary of the changes and the related issue. Highlight any key points that reviewers should focus on.

Checklist

  • I have added tests that prove my fix is effective or that my feature works.
  • I have added necessary documentation (if applicable).
  • My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project [Contribution Guidelines](https://github.com/The-School-of-AI/LLM/tree/main/experiments/

Reviewers

  • Reviewer 1: A member from your own team.
  • Reviewer 2: A member from the repo owners team (@The-School-of-AI/llm-repo-owners).

Note: Every pull request requires atleast 2 reviewers/approvers before it can be merged.

Copy link

@SatishWG SatishWG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks Ok. Check the precommit failures

@AsangCode AsangCode requested review from cydal and rraghu214 and removed request for chethan180, gzagarwal and roulupen February 24, 2026 16:28
@AsangCode AsangCode self-assigned this Feb 24, 2026
@AsangCode AsangCode added the P06 Team 6 label Feb 24, 2026
@AsangCode AsangCode requested a review from pmgarg February 24, 2026 19:22
AsangCode and others added 14 commits February 26, 2026 01:26
2) Selected tokenizer metrics evaluation code has been added.
3) README.md has been updated
- Modified [build_clean_tokenizer.py](cci:7://file:///e:/main_llm_token/LLM/experiments/6_tokenizer_design_lab/build_clean_tokenizer.py:0:0-0:0) targeting 131,046 base tokens to accommodate 26 post-processing special tokens, hitting exactly 131,072.
- Updated [README.md](cci:7://file:///e:/main_llm_token/LLM/experiments/6_tokenizer_design_lab/README.md:0:0-0:0) with new token counts, structure table, correct special token IDs (131046+), and added reproduction instructions.
- Fixed path separator issue in `build_clean_tokenizer.py` for Windows compatibility.
- Regenerated tokenizer artifacts.
- Renamed gptoss_pruning to tsai_131k_tokenizer
- Restructured layout: gptoss tokens first (IDs 0-130715), special tokens at end (IDs 130716-131071)
- Merged base (80) + additional (26) + reserved (250) = 356 special tokens
- Updated build_clean_tokenizer.py with relative paths
- Added special_tokens.py for token definitions
- Removed add_special_tokens.py (merged into build script)
- Updated README with new structure
2) Kronecker embeddings logic is added
- Add clear reproduction steps for generating the tokenizer and embeddings.
…um script

- Replaced tail truncation with automatic PAD token backfilling across all target scripts
- Added explicit `--drop-remainder` CLI flag with rigid dropped-token logging.
- Isolated [tokenize_curriculum.py] into a dedicated `scripts/curriculum_tokenizer/` nested directory.
- Added comprehensive EC2 execution and SSH instructions to a localized `README.md`.
Tokenizer Audit and Fixes (8 changes):
1. Fix bos_token in tokenizer_config.json: stale startoftext from base GPT-OSS, now begin_of_text
2. Fix eos_token in tokenizer_config.json: stale return from base GPT-OSS, now end_of_text
3. Fix pad_token in tokenizer_config.json: stale endoftext, now dedicated pad token
4. Fix pad_token in special_tokens_map.json: was same as EOS, now dedicated pad token
5. Add unk_token to special_tokens_map.json: was missing entirely
6. Add unk_token to tokenizer_config.json: was missing entirely
7. Fix build_clean_tokenizer.py special_tokens_map output: pad_token and unk_token
8. Fix build_clean_tokenizer.py tokenizer_config output: add bos/eos/pad/unk overrides

Build Script Hardening:
- build_clean_tokenizer.py now explicitly sets bos/eos/pad/unk in tokenizer_config.json
- Prevents regression if tokenizer is rebuilt from base GPT-OSS source

Audit Infrastructure:
- Add tokenizer_audit/ folder with audit_tokenizer.py and audit_deep.py
- Main audit: 87 PASS, 0 FAIL, 0 WARN
- Deep audit: 0 issues (cross-file consistency, ID integrity, build script checks)
- Rebuilt tokenizer from fixed build script to confirm clean output
@AsangCode AsangCode force-pushed the P06_Tokenizer_Design_Lab branch from e3aecae to 37caa87 Compare February 25, 2026 20:00
@pankaj1311 pankaj1311 mentioned this pull request Feb 27, 2026
5 tasks
- Added explicit filtering in vocabulary construction to remove tokens
  containing broken UTF-8 replacement characters (U+FFFD).
- Blocked Private Use Area (PUA) codepoints (U+E000-U+F8FF) from being
  included in the tokenizer.
- Reclaimed 25 parameter slots to preserve higher-quality general tokens
  while strictly maintaining the 131,072 (2^17) vocabulary budget.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P06 Team 6

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants