|
2 | 2 | <img src="https://em-content.zobj.net/source/microsoft-teams/363/crayon_1f58d-fe0f.png" width="120" alt="Crayon Logo"/> |
3 | 3 | </p> |
4 | 4 |
|
5 | | -<h1 align="center">🖍️ XERV Crayon v4.0</h1> |
| 5 | +<h1 align="center">🖍️ XERV Crayon v4.1.9</h1> |
6 | 6 |
|
7 | 7 | <p align="center"> |
8 | 8 | <strong>The Omni-Backend Tokenizer for Specialized AI</strong> |
|
30 | 30 | |:--------|:------------| |
31 | 31 | | **💾 Cartridge System** | Instantly hot-swap specialized vocabularies (`science`, `code`, `multilingual`) | |
32 | 32 | | **🚀 Omni-Backend** | Auto-detects & runs on **CPU (AVX2)**, **NVIDIA (CUDA)**, or **AMD (ROCm)** | |
33 | | -| **⚡ Native GPU Kernels** | "Bare Metal" C++/HIP kernels (no wrappers) for >100M tokens/sec | |
| 33 | +| **⚡ Native GPU Kernels** | "Bare Metal" C++/CUDA/HIP kernels (no wrappers) for >10M tokens/sec | |
34 | 34 | | **🗺️ Zero-Copy Mapping** | DAT files loaded via `mmap` for instant startup & minimal RAM | |
35 | 35 | | **🌊 Zero-Disk Streaming** | Build profiles directly from Hugging Face—no multi-GB downloads | |
36 | 36 | | **🛡️ Offline Resilience** | Seamless local bootstrap fallback. Works offline out-of-the-box | |
37 | 37 |
|
38 | 38 | --- |
39 | 39 |
|
40 | | -## 📊 Benchmarks — The Numbers Speak |
| 40 | +## 📊 Benchmarks — Production Results (Tesla T4 GPU) |
41 | 41 |
|
42 | | -> **100% HONEST. NO SUGARCOATING. DATA-DRIVEN.** |
| 42 | +> **100% VERIFIED. GOOGLE COLAB T4 GPU.** |
43 | 43 | > |
44 | | -> Run `python benchmark_competitive.py` to reproduce these results yourself. |
| 44 | +> Complete installation and benchmark logs from actual T4 GPU testing. |
45 | 45 |
|
46 | | -### ⚡ Speed Comparison (Omni-Backend) |
| 46 | +### ⚡ Installation Summary (T4 GPU Environment) |
47 | 47 |
|
48 | | -| Tokenizer | Tokens/sec | vs CRAYON | |
49 | | -|:----------|----------:|:----------| |
50 | | -| **🖍️ CRAYON (CPU - AVX2)** | **21,863,777** | **baseline** | |
51 | | -| **🖍️ CRAYON (CUDA - A100)** | **140,000,000+** | **6.4x faster** | |
52 | | -| tiktoken (GPT-4) | 524,469 | 41x slower | |
53 | | -| HF LLaMA (SP-BPE) | 281,558 | 77x slower | |
54 | | -| HF GPT-2 (BPE) | 237,117 | 92x slower | |
55 | | -| HF BERT (WordPiece) | 202,269 | 108x slower | |
| 48 | +``` |
| 49 | +====================================================================== |
| 50 | +XERV CRAYON V4.1.9 INSTALLATION AND BENCHMARKS |
| 51 | +====================================================================== |
| 52 | +[1/7] Checking environment... |
| 53 | + PyTorch: 2.9.0+cu126 |
| 54 | + CUDA: 12.6 (Tesla T4) |
| 55 | + * Smart Build: Will compile ONLY for this GPU architecture |
| 56 | + NVCC: /usr/local/cuda/bin/nvcc |
| 57 | +
|
| 58 | +[2/7] Installing build dependencies... |
| 59 | + Done (ninja, packaging, wheel) |
| 60 | +
|
| 61 | +[3/7] Cleaning previous installations... |
| 62 | +
|
| 63 | +[4/7] Cloning source code... |
| 64 | + __version__ = "4.1.9" |
| 65 | +
|
| 66 | +[5/7] Compiling and Installing (Streaming Logs)... |
| 67 | +---------------------------------------------------------------------- |
| 68 | +[CRAYON-BUILD] Detected GPU: SM 7.5 -> Compiling for sm_75 ONLY |
| 69 | +[CRAYON-BUILD] Configuring CUDA extension (max_jobs=1) |
| 70 | +
|
| 71 | +building 'crayon.c_ext.crayon_cpu' extension |
| 72 | +[1/1] c++ -O3 -march=native -mavx2 -fPIC -std=c++17 |
| 73 | +Successfully built crayon_cpu.so |
| 74 | +
|
| 75 | +building 'crayon.c_ext.crayon_cuda' extension |
| 76 | +[1/1] nvcc -O3 -std=c++17 --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 |
| 77 | +Successfully built crayon_cuda.so |
| 78 | +
|
| 79 | +Successfully installed xerv-crayon-4.1.9 |
| 80 | +---------------------------------------------------------------------- |
| 81 | +
|
| 82 | +[6/7] Verifying installation... |
| 83 | + Success! Installed version: 4.1.9 |
| 84 | + Backends: {'cpu': True, 'cuda': True, 'rocm': False} |
| 85 | +``` |
| 86 | + |
| 87 | +### 🔥 Performance Results (T4 GPU vs Tiktoken) |
| 88 | + |
| 89 | +**CRAYON (CUDA Backend - Tesla T4):** |
| 90 | +``` |
| 91 | +Active Device: CUDA |
| 92 | +Backend: cuda_extension |
| 93 | +
|
| 94 | +Batch Throughput (XERV CRAYON): |
| 95 | + 1,000 docs: 748,048 docs/sec | 9,724,621 tokens/sec |
| 96 | + 10,000 docs: 639,239 docs/sec | 8,310,109 tokens/sec |
| 97 | + 50,000 docs: 781,129 docs/sec | 10,154,678 tokens/sec |
| 98 | +``` |
| 99 | + |
| 100 | +**Tiktoken (cl100k_base - CPU):** |
| 101 | +``` |
| 102 | +Tiktoken Batch Throughput (cl100k_base encoding): |
| 103 | + 1,000 docs: 87,307 docs/sec | 873,068 tokens/sec |
| 104 | + 10,000 docs: 81,658 docs/sec | 816,576 tokens/sec |
| 105 | + 50,000 docs: 107,583 docs/sec | 1,075,829 tokens/sec |
| 106 | +``` |
| 107 | + |
| 108 | +### 📈 Performance Comparison Table |
| 109 | + |
| 110 | +| Batch Size | CRAYON Docs/Sec | CRAYON Tokens/Sec | Tiktoken Docs/Sec | Tiktoken Tokens/Sec | **Speedup** | |
| 111 | +|:-----------|----------------:|------------------:|------------------:|--------------------:|------------:| |
| 112 | +| 1,000 | 748,048 | 9,724,621 | 87,307 | 873,068 | **11.1x** ✨ | |
| 113 | +| 10,000 | 639,239 | 8,310,109 | 81,658 | 816,576 | **10.2x** ✨ | |
| 114 | +| 50,000 | 781,129 | 10,154,678 | 107,583 | 1,075,829 | **9.4x** ✨ | |
56 | 115 |
|
57 | | -### 📈 CPU Optimization Verification |
58 | | -*Measured on Intel Core i3-7020U (Low-Power Laptop CPU)* |
| 116 | +**Average Speedup: 10.2x faster than tiktoken on Tesla T4 GPU** |
59 | 117 |
|
60 | | -| Metric | Result | |
61 | | -|:-------|:-------| |
62 | | -| ✅ **AVX2 Status** | Active (Simd-Ops v4) | |
63 | | -| ✅ **Load Time** | **0.54ms** (Instant hot-swap) | |
64 | | -| ✅ **Throughput** | **21.1M tokens/sec** (!?!) | |
| 118 | +### 🎯 Key Achievements |
65 | 119 |
|
66 | | - |
| 120 | +- ✅ **>10M tokens/sec** on mid-tier GPU (Tesla T4) |
| 121 | +- ✅ **Smart compilation** - Only builds for detected GPU architecture |
| 122 | +- ✅ **Zero-copy memory mapping** - Instant profile loading (<1ms) |
| 123 | +- ✅ **Production-grade stability** - Handles 50K+ document batches |
| 124 | +- ✅ **Consistent performance** - Minimal variance across batch sizes |
67 | 125 |
|
68 | 126 | --- |
69 | 127 |
|
|
0 commit comments