This repository is adapted from the Bagel repository.
git clone https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT.git
cd Bagel-Zebra-CoT
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
pip install flash_attn --no-build-isolationSet the HF_HOME in download_model.py to the path of the checkpoint you want to download.
python download_model.pyYou can also do this straight from python if your HF_HOME has already been set.
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="multimodal-reasoning-lab/Bagel-Zebra-CoT",
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)The inference script (infz_bf16.py) supports inherent interleaved text and visual reasoning. To customize it for your
specific use case:
Update the checkpoint path to point to your model:
checkpoint_dir = "/path/to/your/HF_HOME/models/Bagel-Zebra-CoT"For example, under the HF_HOME, the path to the checkpoint folder is:
checkpoint_dir = f"{HF_HOME}/models--multimodal-reasoning-lab--Bagel-Zebra-CoT/snapshots/c1ff3c56dd5909841523e3a6b554c77d919c2b28You can also use the local dir:
checkpoint_dir = f"{HF_HOME}/models/Bagel-Zebra-CoT
Edit the prompt and image variables in infz_bf16.py (around lines 203-211):
For single image problems:
prompt = "Your question here"
image = Image.open('path/to/your/image.png')For multiple image problems:
prompt = "Your question about multiple images"
image_1 = Image.open('path/to/image1.jpg')
image_2 = Image.open('path/to/image2.jpg')
image_3 = Image.open('path/to/image3.jpg')
image = [image_1, image_2, image_3] # List of imagesFor text-only problems:
prompt = "Your text-only question"
image = NoneYou can adjust the generation parameters in the inference_hyper dictionary:
inference_hyper = dict(
do_sample=True,
text_temperature=0.3,
cfg_text_scale=4.0,
cfg_img_scale=2.0,
cfg_interval=[0.0, 1.0],
timestep_shift=3.0,
num_timesteps=50,
cfg_renorm_min=0.0,
cfg_renorm_type="text_channel",
)For details, refer to the original jupyter notebook here.
prompt = "Subtract all cylinders. Add 1 red sphere. How many objects are left?"
image = Image.open('test_images/image.png')For training, run
bash scripts/train.shFor details, please refer to the original repo README.
The interleaved reasoning data customized for Zebra-CoT can be found in think_trace_dataset.py.
@inproceedings{
li2026zebracot,
title={Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning},
author={Ang Li and Charles Wang and Deqing Fu and Kaiyu Yue and Zikui Cai
and Wang Bill Zhu and Ollie Liu and Peng Guo and Willie Neiswanger
and Furong Huang and Tom Goldstein and Micah Goldblum},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=c6XIVI3TiQ}
}

