🖍️ XERV Crayon v5.0.1

The Omni-Backend Tokenizer for Specialized AI

Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domain—Quantum Physics, Rust Programming, Financial Law, or anything in between.

🚀 Key Features

Feature	Description
💾 Cartridge System	Instantly hot-swap specialized vocabularies (`science`, `code`, `multilingual`)
🚀 Omni-Backend	Auto-detects & runs on CPU (AVX2), NVIDIA (CUDA), or AMD (ROCm)
⚡ Hyper-Fast Trainer	C++17 Linked-List BPE trains vocabularies in seconds (100x faster)
⚡ Native GPU Kernels	"Bare Metal" C++/CUDA/HIP kernels (no wrappers) for >10M tokens/sec
🗺️ Zero-Copy Mapping	DAT files loaded via `mmap` for instant startup & minimal RAM
🌊 Zero-Disk Streaming	Build profiles directly from Hugging Face—no multi-GB downloads
🛡️ Offline Resilience	Seamless local bootstrap fallback. Works offline out-of-the-box

📊 Benchmarks — Production Results

DATA-DRIVEN. NO HYPE. 100% VERIFIED.

🔥 CPU Performance (Intel i3-7020U AVX2)

Even on modest consumer hardware, Crayon's SIMD-accelerated engine outperforms industry standards by 50x - 100x.

Tokenizer	Tokens/Sec	Speedup vs Crayon
CRAYON (Science)	40,808,299	1.0x (Baseline)
CRAYON (Code)	34,742,588	1.2x slower
Tiktoken (GPT-4)	608,610	67.0x slower
HF LLaMA	343,282	118.8x slower
HF GPT-2	307,563	132.6x slower
HF BERT	195,108	209.1x slower

⚡ GPU Performance (Tesla T4)

⚡ Installation Summary (T4 GPU Environment)

======================================================================
XERV CRAYON V4.1.9 INSTALLATION AND BENCHMARKS
======================================================================
[1/7] Checking environment...
      PyTorch: 2.9.0+cu126
      CUDA: 12.6 (Tesla T4)
      * Smart Build: Will compile ONLY for this GPU architecture
      NVCC: /usr/local/cuda/bin/nvcc

[2/7] Installing build dependencies...
      Done (ninja, packaging, wheel)

[3/7] Cleaning previous installations...

[4/7] Cloning source code...
      __version__ = "4.1.9"

[5/7] Compiling and Installing (Streaming Logs)...
----------------------------------------------------------------------
[CRAYON-BUILD] Detected GPU: SM 7.5 -> Compiling for sm_75 ONLY
[CRAYON-BUILD] Configuring CUDA extension (max_jobs=1)

building 'crayon.c_ext.crayon_cpu' extension
[1/1] c++ -O3 -march=native -mavx2 -fPIC -std=c++17
Successfully built crayon_cpu.so

building 'crayon.c_ext.crayon_cuda' extension
[1/1] nvcc -O3 -std=c++17 --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75
Successfully built crayon_cuda.so

Successfully installed xerv-crayon-4.1.9
----------------------------------------------------------------------

[6/7] Verifying installation...
      Success! Installed version: 4.1.9
      Backends: {'cpu': True, 'cuda': True, 'rocm': False}

🔥 Performance Results (T4 GPU vs Tiktoken)

CRAYON (CUDA Backend - Tesla T4):

Active Device: CUDA
Backend: cuda_extension

Batch Throughput (XERV CRAYON):
     1,000 docs:      748,048 docs/sec |      9,724,621 tokens/sec
    10,000 docs:      639,239 docs/sec |      8,310,109 tokens/sec
    50,000 docs:      781,129 docs/sec |     10,154,678 tokens/sec

Tiktoken (cl100k_base - CPU):

Tiktoken Batch Throughput (cl100k_base encoding):
     1,000 docs:       87,307 docs/sec |        873,068 tokens/sec
    10,000 docs:       81,658 docs/sec |        816,576 tokens/sec
    50,000 docs:      107,583 docs/sec |      1,075,829 tokens/sec

📈 Performance Comparison Table

Batch Size	CRAYON Docs/Sec	CRAYON Tokens/Sec	Tiktoken Docs/Sec	Tiktoken Tokens/Sec	Speedup
1,000	748,048	9,724,621	87,307	873,068	11.1x ✨
10,000	639,239	8,310,109	81,658	816,576	10.2x ✨
50,000	781,129	10,154,678	107,583	1,075,829	9.4x ✨

Average Speedup: 10.2x faster than tiktoken on Tesla T4 GPU

🎯 Key Achievements

✅ >10M tokens/sec on mid-tier GPU (Tesla T4)
✅ Smart compilation - Only builds for detected GPU architecture
✅ Zero-copy memory mapping - Instant profile loading (<1ms)
✅ Production-grade stability - Handles 50K+ document batches
✅ Consistent performance - Minimal variance across batch sizes

⚡ Quick Start: The "Omni-Backend"

Run on any hardware with a single line of code. Crayon automatically detects AVX2, CUDA, or ROCm presence.

1. Hardware-Aware Initialization

from crayon.core.vocabulary import CrayonVocab

# 🔵 CPU (Intel/AMD) - AVX2/AVX-512 Native
vocab = CrayonVocab(device="cpu")

# 🟢 NVIDIA GPUs (All Tensor Core Architectures)
vocab = CrayonVocab(device="cuda")

# 🔴 AMD GPUs (Instinct/Radeon HIP/ROCm)
vocab = CrayonVocab(device="rocm")

2. The "Context Manager" Hot-Swap

Instantly switch between specialized vocabularies within the same script without reloading the model.

vocab = CrayonVocab(device="cpu")
vocab.load_profile("lite")

# ... standard tokenization ...

# ⚡ TEMPORARY SWITCH to 'code' profile for a function block
with vocab.using_profile("code"):
    tokens = vocab.tokenize("def fast_inverse_sqrt(x):")
    # Uses the compact Code vocabulary here
    
# 🔥 AUTOMATICALLY REVERT to 'lite' here

3. Basic Example

import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_cpu # Auto-renamed from crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_cpu.load_dat(mm)

# Ultra-fast tokenization 🚀
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_cpu.tokenize(code)
print(f"Tokens: {tokens}")

📦 Installation

pip install xerv-crayon

Google Colab / Linux Installation

Since Crayon includes high-performance C++ extensions, it will compile natively on your environment:

# Run this in a Colab cell
!pip install xerv-crayon

Build the Extensions

PowerShell (Windows):

python setup.py build_ext --inplace

Bash (Linux/Mac):

python setup.py build_ext --inplace

Note: The setup script auto-detects nvcc and hipcc. If found, GPU backends are built automatically.

🏎️ Omni-Backend Architecture (v4.0)

Crayon now uses a "God Tier" multi-backend implementation combining:

┌─────────────┐      ┌──────────────┐      ┌─────────────┐      ┌──────────────┐
│ vocab.json  │ ──▶  │ DATCompiler  │ ──▶  │  vocab.dat  │ ──▶  │ Omni-Engine  │
│   (List)    │      │ (C++ Fast)   │      │  (Binary)   │      │ CPU/CUDA/HIP │
└─────────────┘      └──────────────┘      └─────────────┘      └─────────────┘

Component	File	Accelerators
CPU Backend	`c_ext/cpu_engine.cpp`	AVX-512 / AVX2 (Intel/AMD)
CUDA Backend	`c_ext/gpu_engine_cuda.cu`	Tensor Cores (NVIDIA Tesla/Ampere)
ROCm Backend	`c_ext/rocm_engine.cpp`	CDNA2 / RDNA3 (AMD Instinct/Radeon)
Zero-Copy Loader	`mmap` + buffer protocol	Instant startup (0.5ms)

🧩 Available Cartridges

5 production-ready profiles defined in src/crayon/core/profiles.py:

Profile	Size	Optimized For	Sources
`standard`	57k	General English (V5 Default)	Lite + Top 10k subwords
`lite`	50k	Speed & Mobile	WikiText, RainDrop
`science`	250k	Reasoning (LaTeX, Quantum, Grad Math)	GRAD, Physics-700
`code`	250k	Syntax (Python, Rust, C++, JS)	CodeParrot, The Stack
`multilingual`	250k	Global (EU langs, Chinese, Hindi)	OSCAR, Wikipedia
`arts_commerce`	250k	Business (Legal, Finance, Lit)	PG19, Fin Phrasebank

vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")

☁️ Verify on Google Colab

✅ Quick Verify Snippet

from crayon import CrayonVocab

# Initialize with Auto-Backend (AVX2/CUDA/ROCm)
tokenizer = CrayonVocab(device="auto")

# 1. Test Standard subword-heavy profile
tokenizer.load_profile("standard")
print(tokenizer.tokenize("that is a test for the standard profile"))

# 2. Test Code specialized profile
tokenizer.load_profile("code")
print(tokenizer.tokenize("def fast_inverse_sqrt(x):"))

🧪 Testing & Verification

# Full verification (Benchmarks + Tests)
python verify_dat_engine.py

# Benchmark all backends
python benchmark_competitive.py

============================================================
XERV CRAYON V4.1.9 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 250,000 tokens
DAT Nodes: 370,000+
Throughput: 40,808,299 tokens/sec
STATUS: ✅ HYPER-PRODUCTION READY

📜 Citation

@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}

📄 License

Built with 💙 by Xerv Research Engineering Division

_{⭐ Star this repo if Crayon helps your project!}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
benchmark_results		benchmark_results
current_benchmark.json		current_benchmark.json
src/crayon		src/crayon
tests		tests
.gitignore		.gitignore
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
CHANGELOG.md		CHANGELOG.md
COLAB_CUDA_TEST.md		COLAB_CUDA_TEST.md
CRAYON_RESEARCH_PAPER.md		CRAYON_RESEARCH_PAPER.md
CRAYON_Research_Paper.tex		CRAYON_Research_Paper.tex
Crayon_Colab_Notebook.py		Crayon_Colab_Notebook.py
DAT_BUILDING_EXPLAINED.md		DAT_BUILDING_EXPLAINED.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
INSTALLATION_FIX.md		INSTALLATION_FIX.md
INSTALLATION_GUIDE.md		INSTALLATION_GUIDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE_NOTES_4.1.9.md		RELEASE_NOTES_4.1.9.md
RELEASE_NOTES_4.3.0.md		RELEASE_NOTES_4.3.0.md
XERV_CRAYON_HYPER_DETAILED_PAPER.md		XERV_CRAYON_HYPER_DETAILED_PAPER.md
benchmark_comparison.png		benchmark_comparison.png
benchmark_results.json		benchmark_results.json
benchmark_suite.py		benchmark_suite.py
build_cuda.py		build_cuda.py
build_production_dat.py		build_production_dat.py
build_standard.py		build_standard.py
colab_demo.py		colab_demo.py
compile_profiles.py		compile_profiles.py
decode_examples.py		decode_examples.py
demo.py		demo.py
demo_omni.py		demo_omni.py
demo_tokenize.py		demo_tokenize.py
download_words.py		download_words.py
export_codebase.py		export_codebase.py
image-1.png		image-1.png
image.png		image.png
init_profiles.py		init_profiles.py
load_and_go.py		load_and_go.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
setup_build.py		setup_build.py
simple_demo.py		simple_demo.py
streamlit_app.py		streamlit_app.py
test.dat		test.dat
test_readme_examples.py		test_readme_examples.py
top_10000_words.txt		top_10000_words.txt
train_code_datasets.py		train_code_datasets.py
train_code_profile.py		train_code_profile.py
train_grad_full.py		train_grad_full.py
train_hf_datasets.py		train_hf_datasets.py
train_vocab.py		train_vocab.py
trained_vocab.json		trained_vocab.json
trained_vocab.txt		trained_vocab.txt
trained_vocab_arts_commerce.json		trained_vocab_arts_commerce.json
trained_vocab_code.json		trained_vocab_code.json
trained_vocab_lite.json		trained_vocab_lite.json
trained_vocab_multilingual.json		trained_vocab_multilingual.json
trained_vocab_science.json		trained_vocab_science.json
upload_testpypi.py		upload_testpypi.py
verify_code_vocab.py		verify_code_vocab.py
verify_dat_engine.py		verify_dat_engine.py
vocab.json		vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🖍️ XERV Crayon v5.0.1

🚀 Key Features

📊 Benchmarks — Production Results

🔥 CPU Performance (Intel i3-7020U AVX2)

⚡ GPU Performance (Tesla T4)

⚡ Installation Summary (T4 GPU Environment)

🔥 Performance Results (T4 GPU vs Tiktoken)

📈 Performance Comparison Table

🎯 Key Achievements

⚡ Quick Start: The "Omni-Backend"

1. Hardware-Aware Initialization

2. The "Context Manager" Hot-Swap

3. Basic Example

📦 Installation

Google Colab / Linux Installation

Build the Extensions

🏎️ Omni-Backend Architecture (v4.0)

🧩 Available Cartridges

☁️ Verify on Google Colab

✅ Quick Verify Snippet

🧪 Testing & Verification

📜 Citation

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🖍️ XERV Crayon v5.0.1

🚀 Key Features

📊 Benchmarks — Production Results

🔥 CPU Performance (Intel i3-7020U AVX2)

⚡ GPU Performance (Tesla T4)

⚡ Installation Summary (T4 GPU Environment)

🔥 Performance Results (T4 GPU vs Tiktoken)

📈 Performance Comparison Table

🎯 Key Achievements

⚡ Quick Start: The "Omni-Backend"

1. Hardware-Aware Initialization

2. The "Context Manager" Hot-Swap

3. Basic Example

📦 Installation

Google Colab / Linux Installation

Build the Extensions

🏎️ Omni-Backend Architecture (v4.0)

🧩 Available Cartridges

☁️ Verify on Google Colab

✅ Quick Verify Snippet

🧪 Testing & Verification

📜 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages