BoundlessBPE

The repo contains the code to accompany the BoundlessBPE paper.

This repository includes Python implementations of the BoundlessBPE tokenization algorithms, including:

Training and Inference code for BoundlessBPE
Code for regular BPE
Code for the PickyBPE variant that includes deletions

Please report any issues or problems to the Craig Schmidt (email address in paper).

Project Structure

boundlessbpe/
├── boundlessbpe/                    # Core Python package
│   ├── __init__.py                  # Package exports (FasterRegexInference, PickyBPE, etc.)
│   ├── uniformbase.py               # Base tokenizer class and core BPE functions
│   ├── boundlessbpetrain.py         # BoundlessBPE training with supermerges & deletions
│   ├── boundlessbpeinference.py     # BoundlessBPE inference implementation
│   ├── bpetrain.py                  # Standard BPE training implementation
│   ├── pickybpe.py                  # PickyBPE with token deletion capabilities
│   ├── regexconstants.py            # Regex patterns (GPT2, GPT4O, ULTIMATE, etc.)
│   └── util.py                      # Utility functions for bytes/string conversion
├── models/                          # Pre-trained model files (.model format) used in the paper
│                                    # note that the models with a vocab size in the name do not currently have any special tokens
├── data/                            # Test datasets (minipile.jsonl - download separately)
├── baselines/                       # Baseline comparison scripts
│   ├── train_tokenizers_gpt4o.py    # Train HuggingFace tokenizers for comparison
│   └── process_hf_tokenizers_rerun.py # Corpus token counts for HuggingFace tokenizers
├── runboundlessbpetrain.py          # example of running BoundlessBPE training 
├── runboundlessinference.py         # example of running BoundlessBPE inference
├── runbpe.py                        # example of running regular BPE training
├── runpickybpe.py                   # example of running PickyBPE training
├── boundless_token_counter.py       # BoundlessBPE token counting code, which is another example of inference
├── run_boundless_token_counter.sh   # Shell script for launching BoundlessBPE token counting
└── requirements.txt                 # Python dependencies

File Overview

Core Package (`boundlessbpe/`)

__init__.py: Package entry point, exports main tokenizer classes
uniformbase.py: Base tokenizer class with core BPE merge logic, derived from minBPE
boundlessbpetrain.py: BoundlessBPE training with supermerges and deletions (FasterHalfDirectRegexTokenizer)
boundlessbpeinference.py: BoundlessBPE inference implementation (FasterRegexInference)
bpetrain.py: Standard BPE training implementation (OnePassRegexTokenizer)
pickybpe.py: "Picky" BPE with deletion capabilities (PickyBPE)
regexconstants.py: Regex patterns for different tokenization schemes
util.py: Helper functions for byte/string conversion and file I/O

Training & Testing Scripts

runboundlessbpetrain.py: Train FasterHalfDirectRegexTokenizer models
runboundlessinference.py: Test inference with trained models
runbpe.py: Train OnePassRegexTokenizer models
runpickybpe.py: Train PickyBPE models (no regex pattern)

Analysis & Utilities

boundless_token_counter.py: Count the total tokens for a BounlessBPE tokenizer on the evaluation set
run_boundless_token_counter.sh: runs of boundless_token_counter.py used in the paper

Baselines (`baselines/`)

train_tokenizers_gpt4o.py: Train HuggingFace BPE/Unigram/WordPiece tokenizers for comparison
process_hf_tokenizers_rerun.py: Compute total token counts using HuggingFace tokenizers for benchmarking

Algorithm Variants

This repository implements several BPE variants described in the BoundlessBPE paper:

FasterHalfDirectRegexTokenizer: Advanced BoundlessBPE training with supermerges and deletions
FasterRegexInference: Stand alone inference implementation for pre-trained models
OnePassRegexTokenizer: Standard BPE training algorithm, with some extra merge selection options
PickyBPE: BPE with token deletion capabilities (tau < 1.0 enables deletions). Can be used as regular BPE with tau > 1.

Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/boundlessbpe.git  # Update with actual URL
cd boundlessbpe

# Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Basic Usage

Inference

from boundlessbpe import FasterRegexInference

# Create fast inference tokenizer
tokenizer = FasterRegexInference()

# Load your trained model
tokenizer.load("models/your_model.model")

# Tokenize text, ignoring any special tokens
# use encode if you have special tokens
text = "Hello, world! This is a test."
token_ids = tokenizer.encode_ordinary(text)
print(f"Tokens: {token_ids}")

# Decode back to text
decoded = tokenizer.decode(token_ids)
assert decoded == text

# Handle special tokens (call after training/loading)
tokenizer.register_special_tokens(["<|endoftext|>"])
tokens_with_special = tokenizer.encode("<|endoftext|>Hello world", allowed_special="all")

Training

from boundlessbpe import FasterHalfDirectRegexTokenizer
from boundlessbpe.regexconstants import GPT4O_SPLIT_PATTERN

# PickyBPE deletion parameter for the IoS threshold, 
# set it to a value > 1.0 to disable deletions
tau = 0.9

# Create BoundlessBPE trainer
tokenizer = FasterHalfDirectRegexTokenizer(tau, pattern=GPT4O_SPLIT_PATTERN)

# Train on dataset
tokenizer.train(
    filepath="data/minipile.jsonl",
    outprefix="models/my_model",
    num_lines=100000,
    vocab_size=40960,
    recalc=1000  # How often to verify dynamic calculations
)

# Register special tokens after training
tokenizer.register_special_tokens({"<|endoftext|>": len(tokenizer.vocab)})

# Save the trained model
tokenizer.save("models/my_model.model")

Training Scripts

Here are some examples of how to invoke the training scripts. It is good to keep a logfile around to track progress.

python -u runboundlessbpetrain.py logfile_halfdirectrerun_1000000_1000000000_131072_0.9_1000_ultimate2_1.txt 2>&1 | tee logfile_halfdirectrerun_1000000_1000000000_131072_0.9_1000_ultimate2_1.txt
python -u runboundlessbpetrain.py logfile_halfdirectrerun_1000000_1000000000_131072_0.9_1000_gpt4o_1.txt 2>&1 | tee logfile_halfdirectrerun_1000000_1000000000_131072_0.9_1000_gpt4o_1.txt
python -u runbpe.py 2>&1 | tee logfile_onepass_count_1M_40960_1.txt
python -u runpickybpe.py 2>&1 | tee logfile_pickbpe_0.9_1GB.txt

Getting the data file

Model files are included in models/ directory

Before running any of these scripts, you need to download the minipile dataset. The easiest way is to download it from the Hugging Face dataset page:

# Create data directory
mkdir -p data

# Option 1: Use huggingface-hub (recommended)
pip install huggingface-hub
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='JeanKaddour/minipile', filename='data/train.jsonl', local_dir='data', local_dir_use_symlinks=False)"
mv data/data/train.jsonl data/minipile.jsonl

# Option 2: Manual download
# Visit https://huggingface.co/datasets/JeanKaddour/minipile/tree/main/data
# Download train.jsonl and save as data/minipile.jsonl

Training and inference speed

The training code is extremely slow. It takes about 5 days to train on 1GB of data. The inference code is also quite slow. We're actively developing a much faster version of both. You might want to wait until that is released to try training on more data.

Requirements

Python: 3.8+
Dependencies:
- Core: regex
- Baselines (optional): tokenizers, transformers

Acknowledgments

This project builds upon and extends minBPE by Andrej Karpathy, which is MIT licensed. Several base components, particularly in uniformbase.py, are derived from the original minBPE implementation, though substantially evolved and extended for the BoundlessBPE algorithm.

License

Apache 2.0 License - see LICENSE file for details.

Citation

If you use BoundlessBPE in your research, please cite:

@misc{schmidt2025boundlessbytepairencoding,
      title={Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier}, 
      author={Craig W. Schmidt and Varshini Reddy and Chris Tanner and Yuval Pinter},
      year={2025},
      eprint={2504.00178},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.00178}, 
}

This paper was presented at COLM 2025.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BoundlessBPE

Project Structure

File Overview

Core Package (`boundlessbpe/`)

Training & Testing Scripts

Analysis & Utilities

Baselines (`baselines/`)

Algorithm Variants

Quick Start

Installation

Basic Usage

Inference

Training

Training Scripts

Getting the data file

Training and inference speed

Requirements

Acknowledgments

License

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
baselines		baselines
boundlessbpe		boundlessbpe
data		data
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
boundless_token_counter.py		boundless_token_counter.py
requirements.txt		requirements.txt
run_boundless_token_counter.sh		run_boundless_token_counter.sh
runboundlessbpetrain.py		runboundlessbpetrain.py
runboundlessinference.py		runboundlessinference.py
runbpe.py		runbpe.py
runpickybpe.py		runpickybpe.py

License

kensho-technologies/boundlessbpe

Folders and files

Latest commit

History

Repository files navigation

BoundlessBPE

Project Structure

File Overview

Core Package (boundlessbpe/)

Training & Testing Scripts

Analysis & Utilities

Baselines (baselines/)

Algorithm Variants

Quick Start

Installation

Basic Usage

Inference

Training

Training Scripts

Getting the data file

Training and inference speed

Requirements

Acknowledgments

License

Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Core Package (`boundlessbpe/`)

Baselines (`baselines/`)

Packages