Tokenization converts text into discrete units (tokens) that language models can process. Traditional approaches use fixed vocabularies (BPE, WordPiece), but modern byte-level methods eliminate the need for predetermined vocabularies.
The choice of tokenization strategy significantly impacts model performance, efficiency, vocabulary size, and the ability to handle multiple languages or domains. Recent approaches operate directly on bytes to achieve true vocabulary-freedom.
Use byte-level tokenization when you need:
- True Multi-linguality: Handle any language without vocabulary constraints
- Robustness: Process any byte sequence (text, code, binary data)
- No Preprocessing: Eliminate tokenizer training and maintenance
- Long-tail Handling: Better performance on rare words/characters
- Simplicity: Single model for all languages and domains
Uses entropy-based dynamic patching to group bytes into variable-length patches, with a latent transformer processing patch-level representations.
Strengths:
- Dynamic patch sizes (adaptive to content complexity)
- Better scaling than fixed tokenization
- No tokenizer vocabulary
- Efficient handling of both simple and complex content
Weaknesses:
- Complex implementation (entropy computation, patching)
- Training requires careful tuning
- Inference overhead from dynamic patching
- Less mature than traditional tokenization
Use when: You want state-of-the-art byte-level modeling with adaptive granularity, especially for mixed-complexity content.
See: byte_latent_transformer.md
Applies Mamba (selective state space model) directly to raw bytes, leveraging SSM efficiency for long-range byte-level modeling.
Strengths:
- Efficient long-sequence modeling (SSM benefits)
- Simpler architecture than BLT (no dynamic patching)
- True language-agnostic
- Better scaling than byte-level transformers
Weaknesses:
- SSM complexity (harder to implement)
- May underperform on short sequences
- Limited to sequential processing
- Newer architecture (fewer resources)
Use when: You need efficient byte-level modeling for long sequences, or want to leverage SSM benefits for tokenizer-free models.
See: mambabyte.md
| Feature | BPE/WordPiece | BLT | MambaByte |
|---|---|---|---|
| Vocabulary | Fixed (30K-100K) | None (256 bytes) | None (256 bytes) |
| Multi-lingual | Limited | Excellent | Excellent |
| Robustness | Poor (OOV) | Excellent | Excellent |
| Sequence Length | Shorter | Medium (patches) | Longer (efficient) |
| Inference Speed | Fast | Medium | Fast-Medium |
| Training Complexity | Low | High | Medium-High |
| Implementation | Simple | Complex | Medium |
| Feature | Byte Latent Transformer | MambaByte |
|---|---|---|
| Architecture | Transformer + patching | Mamba SSM |
| Patch Strategy | Dynamic (entropy) | Fixed byte-level |
| Efficiency | Medium | High (SSM) |
| Long Context | Medium | High |
| Implementation | Complex | Medium |
| Maturity | Cutting-edge (2024) | Recent (2024) |
| Best Use Case | Mixed complexity | Long sequences |
- Pretraining Data: Byte-level models benefit from diverse, multilingual data
- Sequence Length: Start with shorter sequences during training, gradually increase
- Batch Size: Use larger batches than token-based models (compensate for longer sequences)
- Learning Rate: Lower learning rates often work better for byte-level models
- Evaluation: Evaluate on bytes-per-character and bits-per-byte metrics
- Entropy Threshold Tuning: Adjust threshold based on content type (lower for structured data)
- Patch Size Limits: Set max/min patch sizes appropriate for your domain
- Local Encoder Depth: Deeper local encoders for complex within-patch patterns
- Latent Dimension: Balance patch-level and latent-level expressiveness
- State Size: Larger state sizes for more complex dependencies
- SSM Initialization: Use structured SSM initialization for stability
- Convolution Kernel: Adjust kernel size based on local pattern complexity
- Layer Depth: More layers compensate for SSM's different inductive bias
- Byte-level models need raw bytes (UTF-8 encoding)
- No special tokenization or normalization
- Handle byte sequences up to max length
- Byte sequences are ~4x longer than BPE tokens
- Use gradient checkpointing for long sequences
- Consider sequence packing for efficiency
- Warmup learning rate for stability
- Gradient clipping (bytes have different dynamics)
- Mixed precision training (FP16/BF16)
- Sequence Length Mismatch: Forgetting that byte sequences are much longer
- Inefficient Batching: Not packing sequences efficiently
- Wrong Metrics: Using token-based metrics instead of byte-based
- Character Encoding: Mixing encodings (always use UTF-8)
- Evaluation Bias: Comparing to subword models without accounting for granularity
- Bits per Byte (BPB): Lower is better
- Bytes per Character: Efficiency for different languages
- Inference Latency: Time to generate N bytes
- Memory Usage: Peak memory during training/inference
- Serving Latency: Byte-level models may be slower for short text
- Input Preprocessing: Minimal (just byte encoding)
- Output Decoding: UTF-8 decoding with error handling
- Caching: Cache byte-level hidden states for faster generation
- BLT Paper: https://arxiv.org/abs/2412.09871
- MambaByte Paper: https://arxiv.org/abs/2401.13660
- Mamba: https://github.com/state-spaces/mamba
- ByT5 (earlier byte-level work): https://arxiv.org/abs/2105.13626
- Multilingual Modeling: Single model for 100+ languages
- Code Generation: Handle any programming language without tokenizer updates
- Mixed Content: Process text, code, and structured data together
- Rare Language Support: Model low-resource languages without vocabulary design
- Binary Data: Process any byte sequence (not just text)