Skip to content

epfl-dlab/zip2zip

Repository files navigation

zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

Python 3.10+ arXiv Hugging Face License

zip2zip enables inference-time adaptive token vocabularies for large language models (LLMs). It allows vocabularies to be dynamically augmented at inference time, leading to reduced decoding steps and faster inference.

zip2zip decoding

Features

  • Dynamic vocabulary adaptation during inference
  • LZW-based token compression
  • Support for various encoder configurations
  • Integration with Hugging Face's transformers library
  • Compatible with PEFT (Parameter-Efficient Fine-Tuning) models

Installation

You can install zip2zip using pip:

pip install zip2zip

Usage

Same API as Hugging Face

zip2zip Corresponding HF class
Zip2ZipModel AutoModelForCausalLM
Zip2ZipTokenizer AutoTokenizer
Zip2ZipConfig AutoConfig
Zip2ZipModel.from_pretrained AutoModelForCausalLM.from_pretrained
Zip2ZipTokenizer.from_pretrained AutoTokenizer.from_pretrained
Zip2ZipConfig.from_pretrained AutoConfig.from_pretrained

Pretrained model weights

Size Model HF Hub
3.8B Phi-3.5-mini-instruct-v0.1 epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1
14B Llama-3.1-8B-Instruct-v0.1 epfl-dlab/zip2zip-Phi-3-medium-instruct-v0.1
... ... epfl-dlab/zip2zip-models

Run a pretrained model

import torch
from zip2zip import Zip2ZipModel, Zip2ZipTokenizer

pretrained_model_url = "epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1"

device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize tokenizer
tokenizer = Zip2ZipTokenizer.from_pretrained(pretrained_model_url)

# Initialize model
model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map=device)

# Generate text
inputs = tokenizer("Write a MultiHeadAttention layer in PyTorch", return_tensors="pt").to(device)
outputs = model.generate(**inputs)

# Print the coloried
generated_text = tokenizer.color_decode(outputs)

You can apply quantization to the model to reduce the memory usage just as you would do with HF models.

model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map="auto", load_in_8bit=True)

Examples

We provide some examples in the examples folder.

Evaluation

We provide a script to evaluate the performance of the model, compatible with lm-evaluation-harness.

To run the evaluation, you need to install the zip2zip fork of lm-evaluation-harness (the original one is not compatible with zip2zip).

pip install git+https://github.com/epfl-dlab/zip2zip_lm_eval.git

Then, you can run the evaluation:

python bench/run_lm_eval.py

Citation

@misc{geng2025zip2zipinferencetimeadaptivevocabularies,
      title={zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression},
      author={Saibo Geng and Nathan Ranchin and Yunzhen yao and Maxime Peyrard and Chris Wendler and Michael Gastpar and Robert West},
      year={2025},
      eprint={2506.01084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01084},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages