P06 tokenizer design lab by AsangCode · Pull Request #461 · The-School-of-AI/LLM

AsangCode · 2026-02-11T18:00:38Z

Pull Request Template

Description

Please include a summary of the changes and the related issue. Highlight any key points that reviewers should focus on.

Checklist

I have added tests that prove my fix is effective or that my feature works.
I have added necessary documentation (if applicable).
My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project [Contribution Guidelines](https://github.com/The-School-of-AI/LLM/tree/main/experiments/

Reviewers

Reviewer 1: A member from your own team.
Reviewer 2: A member from the repo owners team (@The-School-of-AI/llm-repo-owners).

Note: Every pull request requires atleast 2 reviewers/approvers before it can be merged.

SatishWG

Code looks Ok. Check the precommit failures

2) Selected tokenizer metrics evaluation code has been added. 3) README.md has been updated

- Modified [build_clean_tokenizer.py](cci:7://file:///e:/main_llm_token/LLM/experiments/6_tokenizer_design_lab/build_clean_tokenizer.py:0:0-0:0) targeting 131,046 base tokens to accommodate 26 post-processing special tokens, hitting exactly 131,072. - Updated [README.md](cci:7://file:///e:/main_llm_token/LLM/experiments/6_tokenizer_design_lab/README.md:0:0-0:0) with new token counts, structure table, correct special token IDs (131046+), and added reproduction instructions. - Fixed path separator issue in `build_clean_tokenizer.py` for Windows compatibility. - Regenerated tokenizer artifacts.

- Renamed gptoss_pruning to tsai_131k_tokenizer - Restructured layout: gptoss tokens first (IDs 0-130715), special tokens at end (IDs 130716-131071) - Merged base (80) + additional (26) + reserved (250) = 356 special tokens - Updated build_clean_tokenizer.py with relative paths - Added special_tokens.py for token definitions - Removed add_special_tokens.py (merged into build script) - Updated README with new structure

2) Kronecker embeddings logic is added

- Add clear reproduction steps for generating the tokenizer and embeddings.

…um script - Replaced tail truncation with automatic PAD token backfilling across all target scripts - Added explicit `--drop-remainder` CLI flag with rigid dropped-token logging. - Isolated [tokenize_curriculum.py] into a dedicated `scripts/curriculum_tokenizer/` nested directory. - Added comprehensive EC2 execution and SSH instructions to a localized `README.md`.

…cies

Tokenizer Audit and Fixes (8 changes): 1. Fix bos_token in tokenizer_config.json: stale startoftext from base GPT-OSS, now begin_of_text 2. Fix eos_token in tokenizer_config.json: stale return from base GPT-OSS, now end_of_text 3. Fix pad_token in tokenizer_config.json: stale endoftext, now dedicated pad token 4. Fix pad_token in special_tokens_map.json: was same as EOS, now dedicated pad token 5. Add unk_token to special_tokens_map.json: was missing entirely 6. Add unk_token to tokenizer_config.json: was missing entirely 7. Fix build_clean_tokenizer.py special_tokens_map output: pad_token and unk_token 8. Fix build_clean_tokenizer.py tokenizer_config output: add bos/eos/pad/unk overrides Build Script Hardening: - build_clean_tokenizer.py now explicitly sets bos/eos/pad/unk in tokenizer_config.json - Prevents regression if tokenizer is rebuilt from base GPT-OSS source Audit Infrastructure: - Add tokenizer_audit/ folder with audit_tokenizer.py and audit_deep.py - Main audit: 87 PASS, 0 FAIL, 0 WARN - Deep audit: 0 issues (cross-file consistency, ID integrity, build script checks) - Rebuilt tokenizer from fixed build script to confirm clean output

- Added explicit filtering in vocabulary construction to remove tokens containing broken UTF-8 replacement characters (U+FFFD). - Blocked Private Use Area (PUA) codepoints (U+E000-U+F8FF) from being included in the tokenizer. - Reclaimed 25 parameter slots to preserve higher-quality general tokens while strictly maintaining the 131,072 (2^17) vocabulary budget.

… prosed folders

…w.md

…_131k_tokenizer_hybrid

AsangCode requested review from chethan180, gzagarwal and roulupen February 11, 2026 18:00

AsangCode requested a review from wayoutisin as a code owner February 11, 2026 18:00

AsangCode requested a review from SatishWG February 11, 2026 18:01

SatishWG reviewed Feb 12, 2026

View reviewed changes

AsangCode requested review from cydal and rraghu214 and removed request for chethan180, gzagarwal and roulupen February 24, 2026 16:28

AsangCode self-assigned this Feb 24, 2026

AsangCode added the P06 Team 6 label Feb 24, 2026

AsangCode requested a review from pmgarg February 24, 2026 19:22

pmgarg approved these changes Feb 25, 2026

View reviewed changes

rraghu214 approved these changes Feb 25, 2026

View reviewed changes

AsangCode and others added 14 commits February 26, 2026 01:26

1) Tokenizer selection code has been added.

0aeb9a7

2) Selected tokenizer metrics evaluation code has been added. 3) README.md has been updated

README.md has been updated.

7ae86b2

ncert is put under datasets folder

7cfd546

ncert outside datasets has been removed

d34be9e

1) Code clean up added

bc50056

2) Kronecker embeddings logic is added

Rename Tokenizer_metrics to tokenizer_metrics (case fix)

3786c0d

- Update [README.md]

207e694

- Add clear reproduction steps for generating the tokenizer and embeddings.

added requirements.txt and installation instructions

1c7d236

add directory structure to README

020757a

KRONECKER_TEST_REPORT.md is converted to KRONECKER_TEST_REPORT.txt

72b94af

Fix typo in README.md for tokenizer file description

9680425

Fix formatting in README.md file structure

991478b

AsangCode added 10 commits February 26, 2026 01:26

feat(local): support local dataset processing without S3/AWS dependen…

d3c7c9c

…cies

README.md has been added for tokenizer_audit

9559c8b

fix: resolve pre-commit lint errors (F401, F541, E402, E722, E741, F841)

3b928c9

fix: remove remaining f-string without placeholder (F541)

870c585

fix: remove duplicate vocab keys in gemma_tokenizer.json

44cb338

style: apply black and isort formatting

af3bac7

style: apply black==24.8.0 and isort==5.13.2 --profile=black formatting

d979c86

docs: add setup scripts for git pre-commit hooks

37caa87

AsangCode force-pushed the P06_Tokenizer_Design_Lab branch from e3aecae to 37caa87 Compare February 25, 2026 20:00

pankaj1311 mentioned this pull request Feb 27, 2026

Tokenization of the data #487

Closed

5 tasks

AsangCode requested review from NSR9, aiplaybookin, firekind, pankaj1311, sidrocks and smitasasindran as code owners March 1, 2026 07:15

rraghu214 and others added 11 commits March 5, 2026 00:28

Tokenization script fixed and tested locally

b268ceb

Updated readme to run the tokenization script

9815318

Observations and analysis

777712e

Updated analysis doc

6efda31

added new shards folder format

735bd10

added proposed method parallelism with old directory than auto to new…

1c98a7d

… prosed folders

resolved interupption issue

240e476

Fixed issues with parallel processing & halt-resume scenario

1f3b71e

adding tests and updating AWS analysis in Tokenization_Pipeline_Revie…

f6c2868

…w.md

Merging tokenizer_validation branch tests and Rohan tokenizer as tsai…

4b32a6e

…_131k_tokenizer_hybrid

fixed garbage tokens verbiage in comparison report

98bacbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P06 tokenizer design lab#461

P06 tokenizer design lab#461
AsangCode wants to merge 37 commits intostagingfrom
P06_Tokenizer_Design_Lab

AsangCode commented Feb 11, 2026 •

edited

Loading

Uh oh!

SatishWG left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AsangCode commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Template

Description

Checklist

Reviewers

Uh oh!

SatishWG left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AsangCode commented Feb 11, 2026 •

edited

Loading

SatishWG left a comment •

edited

Loading