Tokenization

Tokenization converts text into discrete units (tokens) that language models can process. Traditional approaches use fixed vocabularies (BPE, WordPiece), but modern byte-level methods eliminate the need for predetermined vocabularies.

Overview

The choice of tokenization strategy significantly impacts model performance, efficiency, vocabulary size, and the ability to handle multiple languages or domains. Recent approaches operate directly on bytes to achieve true vocabulary-freedom.

When to Use Byte-Level Tokenization

Use byte-level tokenization when you need:

True Multi-linguality: Handle any language without vocabulary constraints
Robustness: Process any byte sequence (text, code, binary data)
No Preprocessing: Eliminate tokenizer training and maintenance
Long-tail Handling: Better performance on rare words/characters
Simplicity: Single model for all languages and domains

Approaches

1. Byte Latent Transformer (BLT)

Uses entropy-based dynamic patching to group bytes into variable-length patches, with a latent transformer processing patch-level representations.

Strengths:

Dynamic patch sizes (adaptive to content complexity)
Better scaling than fixed tokenization
No tokenizer vocabulary
Efficient handling of both simple and complex content

Weaknesses:

Complex implementation (entropy computation, patching)
Training requires careful tuning
Inference overhead from dynamic patching
Less mature than traditional tokenization

Use when: You want state-of-the-art byte-level modeling with adaptive granularity, especially for mixed-complexity content.

See: byte_latent_transformer.md

2. MambaByte

Applies Mamba (selective state space model) directly to raw bytes, leveraging SSM efficiency for long-range byte-level modeling.

Strengths:

Efficient long-sequence modeling (SSM benefits)
Simpler architecture than BLT (no dynamic patching)
True language-agnostic
Better scaling than byte-level transformers

Weaknesses:

SSM complexity (harder to implement)
May underperform on short sequences
Limited to sequential processing
Newer architecture (fewer resources)

Use when: You need efficient byte-level modeling for long sequences, or want to leverage SSM benefits for tokenizer-free models.

See: mambabyte.md

Comparison with Traditional Tokenization

Feature	BPE/WordPiece	BLT	MambaByte
Vocabulary	Fixed (30K-100K)	None (256 bytes)	None (256 bytes)
Multi-lingual	Limited	Excellent	Excellent
Robustness	Poor (OOV)	Excellent	Excellent
Sequence Length	Shorter	Medium (patches)	Longer (efficient)
Inference Speed	Fast	Medium	Fast-Medium
Training Complexity	Low	High	Medium-High
Implementation	Simple	Complex	Medium

Comparison Matrix

Feature	Byte Latent Transformer	MambaByte
Architecture	Transformer + patching	Mamba SSM
Patch Strategy	Dynamic (entropy)	Fixed byte-level
Efficiency	Medium	High (SSM)
Long Context	Medium	High
Implementation	Complex	Medium
Maturity	Cutting-edge (2024)	Recent (2024)
Best Use Case	Mixed complexity	Long sequences

Best Practices

General Byte-Level Modeling

Pretraining Data: Byte-level models benefit from diverse, multilingual data
Sequence Length: Start with shorter sequences during training, gradually increase
Batch Size: Use larger batches than token-based models (compensate for longer sequences)
Learning Rate: Lower learning rates often work better for byte-level models
Evaluation: Evaluate on bytes-per-character and bits-per-byte metrics

Byte Latent Transformer

Entropy Threshold Tuning: Adjust threshold based on content type (lower for structured data)
Patch Size Limits: Set max/min patch sizes appropriate for your domain
Local Encoder Depth: Deeper local encoders for complex within-patch patterns
Latent Dimension: Balance patch-level and latent-level expressiveness

MambaByte

State Size: Larger state sizes for more complex dependencies
SSM Initialization: Use structured SSM initialization for stability
Convolution Kernel: Adjust kernel size based on local pattern complexity
Layer Depth: More layers compensate for SSM's different inductive bias

Training Considerations

Data Preprocessing

Byte-level models need raw bytes (UTF-8 encoding)
No special tokenization or normalization
Handle byte sequences up to max length

Memory Requirements

Byte sequences are ~4x longer than BPE tokens
Use gradient checkpointing for long sequences
Consider sequence packing for efficiency

Optimization

Warmup learning rate for stability
Gradient clipping (bytes have different dynamics)
Mixed precision training (FP16/BF16)

Common Pitfalls

Sequence Length Mismatch: Forgetting that byte sequences are much longer
Inefficient Batching: Not packing sequences efficiently
Wrong Metrics: Using token-based metrics instead of byte-based
Character Encoding: Mixing encodings (always use UTF-8)
Evaluation Bias: Comparing to subword models without accounting for granularity

Performance Metrics

Bits per Byte (BPB): Lower is better
Bytes per Character: Efficiency for different languages
Inference Latency: Time to generate N bytes
Memory Usage: Peak memory during training/inference

Deployment Considerations

Serving Latency: Byte-level models may be slower for short text
Input Preprocessing: Minimal (just byte encoding)
Output Decoding: UTF-8 decoding with error handling
Caching: Cache byte-level hidden states for faster generation

Resources

BLT Paper: https://arxiv.org/abs/2412.09871
MambaByte Paper: https://arxiv.org/abs/2401.13660
Mamba: https://github.com/state-spaces/mamba
ByT5 (earlier byte-level work): https://arxiv.org/abs/2105.13626

Example Use Cases

Multilingual Modeling: Single model for 100+ languages
Code Generation: Handle any programming language without tokenizer updates
Mixed Content: Process text, code, and structured data together
Rare Language Support: Model low-resource languages without vocabulary design
Binary Data: Process any byte sequence (not just text)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization

Overview

When to Use Byte-Level Tokenization

Approaches

1. Byte Latent Transformer (BLT)

2. MambaByte

Comparison with Traditional Tokenization

Comparison Matrix

Best Practices

General Byte-Level Modeling

Byte Latent Transformer

MambaByte

Training Considerations

Data Preprocessing

Memory Requirements

Optimization

Common Pitfalls

Performance Metrics

Deployment Considerations

Resources

Example Use Cases

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Tokenization

Overview

When to Use Byte-Level Tokenization

Approaches

1. Byte Latent Transformer (BLT)

2. MambaByte

Comparison with Traditional Tokenization

Comparison Matrix

Best Practices

General Byte-Level Modeling

Byte Latent Transformer

MambaByte

Training Considerations

Data Preprocessing

Memory Requirements

Optimization

Common Pitfalls

Performance Metrics

Deployment Considerations

Resources

Example Use Cases