ByteDance Bagel - Image Understanding and Generation #242

francoishernandez · 2025-06-04T15:40:54Z

⚠️ branch based on #238

This is a follow-up on #240.

This is not ready to merge, but should be a good starting point to start adapting the structure to support image generation.
The Bagel model is not super clean and has quite a few specific modules which make it difficult to rationalize, but it is IMO a good candidate to explore new modalities (image generation, thinking, etc.)

This also allows to test simple BNB quantization, allowing to fit the whole model on a 24GB GPU (unlike the official code which offloads parts to CPU). For reference, without any optimization, image generation runs at approx. 3 seconds per timestep on a 3090 (+ 5950x cpu) -- 30-50 timesteps being the sweet spot it seems.

What works

simple vision understanding query (e.g. GDP image + prompt) -> test_bagel_understanding.py
simple image generation -> test_bagel_generation.py

What needs to be fixed/rationalized

positions handling currently breaks other vision models, we need to find a proper condition (maybe split out classes as previously discussed)
image autoencoder settings are hardcoded/copy-pasted from the official code
some settings are not properly grabbed from config yet
image transform logic could probably be factorized a bit with current logic
the image generation codepath triggers an early exit in inference.decode_and_generate, we should probably find a cleaner way to support this (+ support in serving mode)

What needs to be implemented/tested

image "edition" use case (+ image cfg and co)
"thinking" step support (might be useful for other models as well)
multiple image handling (understanding path, probably not supported in generation path yet)
batch mode

francoishernandez added 3 commits May 28, 2025 11:50

WIP bagel vision understanding patches

465db9b

WIP super dirty but image generation WORKS (mostly)

0d99a39

enable text cfg, some config definitions

6983164

francoishernandez added enhancement New feature or request recipes labels Jun 4, 2025

francoishernandez force-pushed the bagel branch 3 times, most recently from 988602c to 9b87f54 Compare June 4, 2025 16:42

clean up and structure

d156b05

francoishernandez force-pushed the bagel branch from 9b87f54 to d156b05 Compare June 4, 2025 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ByteDance Bagel - Image Understanding and Generation #242

ByteDance Bagel - Image Understanding and Generation #242

Uh oh!

francoishernandez commented Jun 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ByteDance Bagel - Image Understanding and Generation #242

Are you sure you want to change the base?

ByteDance Bagel - Image Understanding and Generation #242

Uh oh!

Conversation

francoishernandez commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What works

What needs to be fixed/rationalized

What needs to be implemented/tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

francoishernandez commented Jun 4, 2025 •

edited

Loading