Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
0aeb9a7
1) Tokenizer selection code has been added.
AsangCode Feb 7, 2026
7ae86b2
README.md has been updated.
AsangCode Feb 7, 2026
7cfd546
ncert is put under datasets folder
AsangCode Feb 7, 2026
d34be9e
ncert outside datasets has been removed
AsangCode Feb 7, 2026
43c4157
feat(tokenizer): expand vocabulary to 131,072 (2^17) tokens
AsangCode Feb 9, 2026
dc6860b
Refactor tokenizer: rename to tsai_131k_tokenizer, special tokens at end
chethan180 Feb 10, 2026
bc50056
1) Code clean up added
AsangCode Feb 11, 2026
3786c0d
Rename Tokenizer_metrics to tokenizer_metrics (case fix)
AsangCode Feb 11, 2026
207e694
- Update [README.md]
AsangCode Feb 11, 2026
1c7d236
added requirements.txt and installation instructions
AsangCode Feb 11, 2026
020757a
add directory structure to README
AsangCode Feb 11, 2026
72b94af
KRONECKER_TEST_REPORT.md is converted to KRONECKER_TEST_REPORT.txt
AsangCode Feb 11, 2026
9680425
Fix typo in README.md for tokenizer file description
AsangCode Feb 11, 2026
991478b
Fix formatting in README.md file structure
AsangCode Feb 11, 2026
e1f9564
gptoss_kronecker_config.json has been removed
AsangCode Feb 11, 2026
7e46e34
feat(tokenizer): enforce pure row-to-eos mapping and isolate curricul…
AsangCode Feb 21, 2026
d3c7c9c
feat(local): support local dataset processing without S3/AWS dependen…
AsangCode Feb 21, 2026
c3662e1
fix: tokenizer config audit fixes and build script hardening
AsangCode Feb 24, 2026
9559c8b
README.md has been added for tokenizer_audit
AsangCode Feb 24, 2026
3b928c9
fix: resolve pre-commit lint errors (F401, F541, E402, E722, E741, F841)
AsangCode Feb 24, 2026
870c585
fix: remove remaining f-string without placeholder (F541)
AsangCode Feb 24, 2026
44cb338
fix: remove duplicate vocab keys in gemma_tokenizer.json
AsangCode Feb 24, 2026
af3bac7
style: apply black and isort formatting
AsangCode Feb 24, 2026
d979c86
style: apply black==24.8.0 and isort==5.13.2 --profile=black formatting
AsangCode Feb 24, 2026
37caa87
docs: add setup scripts for git pre-commit hooks
AsangCode Feb 24, 2026
42f1013
Enhance tokenizer quality by pruning garbage U+FFFD and PUA tokens
AsangCode Mar 1, 2026
b268ceb
Tokenization script fixed and tested locally
rraghu214 Mar 4, 2026
9815318
Updated readme to run the tokenization script
rraghu214 Mar 4, 2026
777712e
Observations and analysis
rraghu214 Mar 5, 2026
6efda31
Updated analysis doc
rraghu214 Mar 5, 2026
735bd10
added new shards folder format
pmgarg Mar 5, 2026
1c98a7d
added proposed method parallelism with old directory than auto to new…
pmgarg Mar 5, 2026
240e476
resolved interupption issue
pmgarg Mar 5, 2026
1f3b71e
Fixed issues with parallel processing & halt-resume scenario
rraghu214 Mar 6, 2026
f6c2868
adding tests and updating AWS analysis in Tokenization_Pipeline_Revie…
rraghu214 Mar 6, 2026
4b32a6e
Merging tokenizer_validation branch tests and Rohan tokenizer as tsai…
rraghu214 Mar 7, 2026
98bacbf
fixed garbage tokens verbiage in comparison report
rraghu214 Mar 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Force LF for tokenizer files to ensure consistent hashing across platforms
experiments/**/tsai_131k_tokenizer/*.json text eol=lf
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ logs/

*.pt
*.pth
*.parquet

# macOS specific files
*.DS_Store
4 changes: 4 additions & 0 deletions experiments/6_tokenizer_design_lab/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
venv/
venv
.venv/
__pycache__/
115 changes: 115 additions & 0 deletions experiments/6_tokenizer_design_lab/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# TSAI 131K Tokenizer

## Overview

This directory contains the **TSAI 131K Tokenizer**, a pruned GPToss tokenizer optimized for 131,072 (2^17) vocabulary size while retaining Indic language support.

## Directory Structure

```text
.
├── tsai_131k_tokenizer/ # Generated tokenizer files
│ ├── tokenizer.json # Our tokenizer file
│ ├── tokenizer_config.json
│ └── special_tokens_map.json
├── kronecker_embeddings/ # Kronecker embeddings scripts & docs
│ ├── convert_tokenizer_to_kronecker.py
│ └── README.md
├── tokenizer_metrics/ # Evaluation metrics and graphs
├── build_clean_tokenizer.py # Script to build the tokenizer
├── special_tokens.py # Special token definitions
├── requirements.txt # Python dependencies
└── README.md
```

## Installation

To set up the environment, it is recommended to use a virtual environment:

```bash
# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
.\venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install Git Pre-commit Hooks (Required)
# This ensures all code is formatted properly before committing.
# Windows:
.\setup_dev.bat
# Linux/Mac:
bash setup_dev.sh
```

## Reproduction

To regenerate the tokenizer and embeddings from scratch:

1. **Build the tokenizer**:
```bash
python build_clean_tokenizer.py
```
This will generate the tokenizer in `tsai_131k_tokenizer/`.

2. **Generate Kronecker Embeddings**:
```bash
cd kronecker_embeddings
python convert_tokenizer_to_kronecker.py --tokenizer ../tsai_131k_tokenizer/tokenizer.json --output-dir .
```
This will generate `gptoss_kronecker_embeddings.pt` (and `.npy`) in the `kronecker_embeddings/` directory.

## Usage

The following example demonstrates how to load the tokenizer and the corresponding Kronecker embeddings together:

```python
import torch
from transformers import AutoTokenizer

# 1. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("./tsai_131k_tokenizer")

# Test encoding
text = "Hello, यह एक परीक्षण है"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# 2. Load Kronecker Embeddings
# These embeddings map each token ID directly to a vector (e.g., 8192-dim)
embeddings_path = "kronecker_embeddings/gptoss_kronecker_embeddings.pt"
if os.path.exists(embeddings_path):
embeddings = torch.load(embeddings_path)
print(f"Loaded embeddings: {embeddings.shape}")

# Get embedding for a specific token
token_id = tokens[0]
token_emb = embeddings[token_id]
print(f"Embedding for token {token_id}: {token_emb.shape}")
else:
print(f"Embeddings file not found at {embeddings_path}. Please run reproduction steps.")
```

## Metrics & Performance

The following graphs summarize the performance of the tokenizer across different domains:

![Bytes per Token](tokenizer_metrics/graphs/Summary_Bytes_Token.png)
*Bytes per Token (Lower is Better)*

![Fertility](tokenizer_metrics/graphs/Summary_Fertility.png)
*Fertility (Tokens per Word)*

![Speed](tokenizer_metrics/graphs/Summary_Speed.png)
*Speed (Tokens/sec) (Higher is Better)*

![Fallback Rate](tokenizer_metrics/graphs/Summary_Fallback.png)
*Byte Fallback Rate (Lower is Better)*

![Vocab Gini](tokenizer_metrics/graphs/Summary_Vocab.png)
*Vocabulary Inequality (Higher = Less Balanced)*
Loading