Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

BAGEL Training Zebra-CoT

This repository is adapted from the Bagel repository.

Setup

git clone https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT.git
cd Bagel-Zebra-CoT
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
pip install flash_attn --no-build-isolation

Download checkpoint

Set the HF_HOME in download_model.py to the path of the checkpoint you want to download.

python download_model.py

You can also do this straight from python if your HF_HOME has already been set.

from huggingface_hub import snapshot_download

snapshot_download(
  repo_id="multimodal-reasoning-lab/Bagel-Zebra-CoT",
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

Inference

The inference script (infz_bf16.py) supports inherent interleaved text and visual reasoning. To customize it for your specific use case:

1. Model Checkpoint Path

Update the checkpoint path to point to your model:

checkpoint_dir = "/path/to/your/HF_HOME/models/Bagel-Zebra-CoT"

For example, under the HF_HOME, the path to the checkpoint folder is:

checkpoint_dir = f"{HF_HOME}/models--multimodal-reasoning-lab--Bagel-Zebra-CoT/snapshots/c1ff3c56dd5909841523e3a6b554c77d919c2b28

You can also use the local dir:

checkpoint_dir = f"{HF_HOME}/models/Bagel-Zebra-CoT

2. Setting up prompt and images

Edit the prompt and image variables in infz_bf16.py (around lines 203-211):

For single image problems:

prompt = "Your question here"
image = Image.open('path/to/your/image.png')

For multiple image problems:

prompt = "Your question about multiple images"
image_1 = Image.open('path/to/image1.jpg')
image_2 = Image.open('path/to/image2.jpg')
image_3 = Image.open('path/to/image3.jpg')
image = [image_1, image_2, image_3]  # List of images

For text-only problems:

prompt = "Your text-only question"
image = None

3. Inference Parameters

You can adjust the generation parameters in the inference_hyper dictionary:

inference_hyper = dict(
    do_sample=True,
    text_temperature=0.3,
    cfg_text_scale=4.0,
    cfg_img_scale=2.0,
    cfg_interval=[0.0, 1.0],
    timestep_shift=3.0,
    num_timesteps=50,
    cfg_renorm_min=0.0,
    cfg_renorm_type="text_channel",
)

For details, refer to the original jupyter notebook here.

Example Use Cases

prompt = "Subtract all cylinders. Add 1 red sphere. How many objects are left?"
image = Image.open('test_images/image.png')

Training

For training, run

bash scripts/train.sh

For details, please refer to the original repo README.

The interleaved reasoning data customized for Zebra-CoT can be found in think_trace_dataset.py.

Cite

@inproceedings{
  li2026zebracot,
  title={Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning},
  author={Ang Li and Charles Wang and Deqing Fu and Kaiyu Yue and Zikui Cai
          and Wang Bill Zhu and Ollie Liu and Peng Guo and Willie Neiswanger
          and Furong Huang and Tom Goldstein and Micah Goldblum},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=c6XIVI3TiQ}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
assets		assets
data		data
modeling		modeling
scripts		scripts
test_images		test_images
train		train
.gitignore		.gitignore
EVAL.md		EVAL.md
LICENSE		LICENSE
README.md		README.md
TRAIN.md		TRAIN.md
app.py		app.py
download_model.py		download_model.py
inference.ipynb		inference.ipynb
inferencer.py		inferencer.py
infz_bf16.py		infz_bf16.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

BAGEL Training Zebra-CoT

Setup

Download checkpoint

Inference

1. Model Checkpoint Path

2. Setting up prompt and images

3. Inference Parameters

Example Use Cases

Training

Cite

About

Uh oh!

Contributors 9

Uh oh!

Languages

License

multimodal-reasoning-lab/Bagel-Zebra-CoT

Folders and files

Latest commit

History

Repository files navigation

Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

BAGEL Training Zebra-CoT

Setup

Download checkpoint

Inference

1. Model Checkpoint Path

2. Setting up prompt and images

3. Inference Parameters

Example Use Cases

Training

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 9

Uh oh!

Languages