This repository provides a clean, educational, and visual implementation of the Byte Pair Encoding (BPE) algorithm, commonly used in NLP models like GPT and BERT.
It merges frequent character pairs step-by-step to create an efficient subword vocabulary.
Byte Pair Encoding (BPE) is a tokenization technique that:
- Helps handle unknown words in NLP.
- Splits rare words into frequent subword units.
- Builds a vocabulary of commonly seen character combinations.
BPE is widely used in transformer-based models such as:
- GPT (OpenAI)
- BERT (Google)
- RoBERTa, T5, etc.
Experience BPE in action with our interactive web interface:
- Real-time Visualization - Watch BPE algorithm work step-by-step
- Interactive Input - Test with your own text
- Beautiful Animations - Smooth transitions and dynamic effects
- Detailed Breakdown - See merge operations, vocabulary growth, and final tokenization
- Responsive Design - Works perfectly on desktop and mobile devices
The web demo provides an intuitive way to understand how BPE constructs subword vocabularies through iterative merging.
- Simple character-level BPE logic
- Step-by-step merging visualization
- Dynamic vocabulary growth tracking
- Auto-stops when no more frequent pairs exist
- Pure Python (no external libraries)
| File | Description |
|---|---|
bpe_tokenizer.py |
Detailed BPE implementation with verbose logging |
bpe_tokenizer_2.py |
Concise, functional BPE version |
image.png |
Visual explanation of BPE process |
index.html |
Interactive web interface for BPE visualization |
- Split each word into characters.
- Count all adjacent character pairs.
- Merge the most frequent pair into a new token.
- Repeat until no frequent pairs remain or a stop condition is reached.
- Return final vocabulary and encoded corpus.
"set new new renew reset renew"Merging: n e → ne (count 3)
Updated Corpus: _ s e t _ ne w _ re ne w ...
Updated Vocabulary: ['_', 'e', 'n', 'ne', 'r', 's', 't', 'w'] ...
Final Tokenized Corpus:
_ s e t
_ ne w
_ re ne w
...
Learned Merge Operations:
n e → ne (count 3)
...from bpe_tokenizer import byte_pair_encoding
byte_pair_encoding("set new new renew reset renew")Or using the compact version:
from bpe_tokenizer_2 import byte_pair_encoding
vocab, encoded = byte_pair_encoding("set new new renew reset renew", num_merges=5)
print(vocab)
print(encoded)Simply open index.html in your browser or visit the live demo to:
- Enter your custom text
- Click "Analyze Text Now"
- Watch the algorithm visualize each merge step
- Explore the final vocabulary and tokenization results
The algorithm shows how subwords like:
neresetnew
...are learned and combined through frequency-based merging.
This makes the BPE tokenizer ideal for:
- Compressing large vocabularies
- Handling rare or unseen tokens
- Improving model generalization in NLP tasks
- Reducing out-of-vocabulary (OOV) issues
