zip2zip enables inference-time adaptive token vocabularies for large language models (LLMs). It allows vocabularies to be dynamically augmented at inference time, leading to reduced decoding steps and faster inference.
- Dynamic vocabulary adaptation during inference
- LZW-based token compression
- Support for various encoder configurations
- Integration with Hugging Face's transformers library
- Compatible with PEFT (Parameter-Efficient Fine-Tuning) models
You can install zip2zip using pip:
pip install zip2zip
zip2zip | Corresponding HF class |
---|---|
Zip2ZipModel | AutoModelForCausalLM |
Zip2ZipTokenizer | AutoTokenizer |
Zip2ZipConfig | AutoConfig |
Zip2ZipModel.from_pretrained | AutoModelForCausalLM.from_pretrained |
Zip2ZipTokenizer.from_pretrained | AutoTokenizer.from_pretrained |
Zip2ZipConfig.from_pretrained | AutoConfig.from_pretrained |
Size | Model | HF Hub |
---|---|---|
3.8B | Phi-3.5-mini-instruct-v0.1 | epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1 |
14B | Llama-3.1-8B-Instruct-v0.1 | epfl-dlab/zip2zip-Phi-3-medium-instruct-v0.1 |
... | ... | epfl-dlab/zip2zip-models |
import torch
from zip2zip import Zip2ZipModel, Zip2ZipTokenizer
pretrained_model_url = "epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize tokenizer
tokenizer = Zip2ZipTokenizer.from_pretrained(pretrained_model_url)
# Initialize model
model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map=device)
# Generate text
inputs = tokenizer("Write a MultiHeadAttention layer in PyTorch", return_tensors="pt").to(device)
outputs = model.generate(**inputs)
# Print the coloried
generated_text = tokenizer.color_decode(outputs)
You can apply quantization to the model to reduce the memory usage just as you would do with HF models.
model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map="auto", load_in_8bit=True)
We provide some examples in the examples
folder.
We provide a script to evaluate the performance of the model, compatible with lm-evaluation-harness.
To run the evaluation, you need to install the zip2zip fork of lm-evaluation-harness (the original one is not compatible with zip2zip).
pip install git+https://github.com/epfl-dlab/zip2zip_lm_eval.git
Then, you can run the evaluation:
python bench/run_lm_eval.py
@misc{geng2025zip2zipinferencetimeadaptivevocabularies,
title={zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression},
author={Saibo Geng and Nathan Ranchin and Yunzhen yao and Maxime Peyrard and Chris Wendler and Michael Gastpar and Robert West},
year={2025},
eprint={2506.01084},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01084},
}