ByteDance Bagel - Image Understanding and Generation #242
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a follow-up on #240.
This is not ready to merge, but should be a good starting point to start adapting the structure to support image generation.
The Bagel model is not super clean and has quite a few specific modules which make it difficult to rationalize, but it is IMO a good candidate to explore new modalities (image generation, thinking, etc.)
This also allows to test simple BNB quantization, allowing to fit the whole model on a 24GB GPU (unlike the official code which offloads parts to CPU). For reference, without any optimization, image generation runs at approx. 3 seconds per timestep on a 3090 (+ 5950x cpu) -- 30-50 timesteps being the sweet spot it seems.
What works
test_bagel_understanding.pytest_bagel_generation.pyWhat needs to be fixed/rationalized
inference.decode_and_generate, we should probably find a cleaner way to support this (+ support in serving mode)What needs to be implemented/tested