Skip to content

Align Qwen-Image with diffusers reference#362

Merged
filipstrand merged 6 commits intofilipstrand:mainfrom
ciaranbor:fix-qwen
Feb 20, 2026
Merged

Align Qwen-Image with diffusers reference#362
filipstrand merged 6 commits intofilipstrand:mainfrom
ciaranbor:fix-qwen

Conversation

@ciaranbor
Copy link
Contributor

Hey, I've been using mflux to add image model support over at exo. Thanks for the great work!

I've been getting some strange outputs from Qwen-Image at times. I believe I've seen it mentioned in issues here too but I can't find any right now.

This PR fixes several configuration and numerical differences between the mflux Qwen-Image implementation and the diffusers reference, improving output quality for both txt2img and edit pipelines.

Diffusers also uses CONDITION_IMAGE_SIZE=1024*1024 but this made generation much slower so I left it as is.

Example before and after outputs for

PROMPT = "ronaldo"
NEGATIVE_PROMPT = " "
SEED = 2
HEIGHT, WIDTH = 1024, 1024
NUM_STEPS = 25
GUIDANCE = 4.0
main_output pr_output

Changes

Sigma shift schedule — Qwen-Image was not applying any sigma shift (requires_sigma_shift=None), using a plain linspace noise schedule. The diffusers reference applies a dynamic exponential shift with max_shift=0.9
and max_image_seq_len=8192, plus a stretch-to-terminal that pins the final sigma at 0.02. The shift parameters are now configurable per model on ModelConfig (defaults match FLUX's existing hardcoded values, so
other models are unaffected).

SDPA in text encoder — Replaced manual matmul → mask → softmax → matmul with mx.fast.scaled_dot_product_attention, avoiding materialization of the full attention weight matrix.

SDPA + float32 ROPE in vision encoder — Same SDPA replacement for both chunked (windowed) and full attention paths. ROPE is now computed in float32 to reduce precision loss from bfloat16 trig operations.

Edit prompt template — Changed use_picture_prefix from True to False to match the diffusers reference prompt format. With True, images were inserted as Picture 1: <vision_tokens> dynamically; with False, the vision
tokens are part of the template and the prompt is appended after them.

Tokenizer config — max_length increased from 1024 to 1058 and padding set to "longest" to match the diffusers reference configuration.

@filipstrand
Copy link
Owner

Thanks for the contribution, this looks great! I'll run a few tests on my end and merge this afterwards if anything looks good. I've also seen some strange looking output for some images but haven't had the time to prioritise this.

Once Qwen 2.0 is released to the open, I'll probably retire the older models if the newer one is uniformly better, but still very nice to have this fix!

@filipstrand
Copy link
Owner

filipstrand commented Feb 20, 2026

I ran the image test suite and had to update some reference image (which is expected - all looked "equally as good" as before) but there is one of our test cases that has multiple input images that looks a bit off

this branch (one input image, as reference):
output_qwen_edit

current main (two input images):
reference_qwen_edit_multiple_images

this branch (two input images):
output_qwen_edit_multiple_images

But this might also be that something else/previous is actually wrong with this logic (not necessarily related to this PR). This reminds me a bit of #330, where this particular test case also looked a bit off...

Maybe this is just an unlucky choice of test image/prompt/seed, (I'll try to iterate a bit and see if this is the case). Regardless, I think your Rolando example clearly highlights that something is wrong with the current implementation, but I mostly want to see if we should update our test suite too for Qwen Image.

Edit/update: Actually, the new reference might not actually be worse now when I look at it a bit more closely..., it mostly feels like the scaling is "off" or stretched out, but actually hard to say that this is worse in a strict objective sense (I have not compared this with diffusers). Maybe the only "objective" thing to note is that with this branch there is now a bigger shift in perspective between the test_image_generation_qwen_edit (white t-shirt) and test_image_generation_qwen_edit_multiple_images (striped shirt)...

@filipstrand
Copy link
Owner

Some more tests

uv run mflux-generate-qwen-edit 
--prompt "Make the hand fistbump the camera instead of showing a flat palm, and the man should wear this shirt. Maintain the original pose, body position, and overall stance." 
--image-paths tests/resources/reference_upscaled.png tests/resources/shirt.jpg 
--width 640 
--height 384 
--steps 20 
--guidance 2.5 
--seed 1111 2222 3333 
-q 8 
Screenshot 2026-02-20 at 17 42 33

I can also reproduce your example above, which is objectively better, so I'm happy to merge this PR. I wrote about this somewhere, but specifically for Qwen-Image, I have some doubts over my current implementation in general. Out of all models in MFLUX this one has been the trickiest to debug and work with in general (also because of its size). It also has responded quite bad to stronger quantisation compared to Flux/Z-image. It was also built back in summer when coding tools were substantially worse compared to now.

When the 2.0 model is released, I look forward to do a fresh implementation with that one and hope it performs better on all fronts so we can deprecate the old one.

@ciaranbor
Copy link
Contributor Author

The issues are hard to reproduce, only happening for certain seeds and prompts. The changes seem to help more for txt2img than editing.

Out of interest I ran the test using diffusers and this is what I got:

diffusers_output

which is quite different again.

@filipstrand
Copy link
Owner

Tried running this with the Flux2 test image just to see what it produced

uv run mflux-generate-qwen-edit 
--prompt "Make the woman wear the eyeglasses (regular glasses, not sunglasses)" 
--image-paths tests/resources/unsplash_person.jpg tests/resources/glasses.jpg 
--width 1344 
--height 896 
--steps 20 
--guidance 1.0 
--seed 45 46 47 
-q 8 
Screenshot 2026-02-20 at 19 11 57

Both are honestly pretty bad in both cases, perhaps more variation in the types of glasses (compared to reference input glasses) in this PR, but I honestly don't know how much it says.

This is Flu2Klein in comparison - much more consistent, much faster (same settings but 4 steps instead of 20):

uv run mflux-generate-flux2-edit 
--model flux2-klein-9b 
--prompt "Make the woman wear the eyeglasses (regular glasses, not sunglasses)" 
--image-paths tests/resources/unsplash_person.jpg tests/resources/glasses.jpg 
--width 1344 
--height 896 
--steps 4 
--guidance 1.0 
--seed 45 46 47 
-q 8 
Screenshot 2026-02-20 at 19 19 45

@filipstrand
Copy link
Owner

Out of interest I ran the test using diffusers and this is what I got

Oh yes, I remember that "zoomed in" effect which also seemed very weird from diffusers (I remember I saw that same effect on other images too)... That made the whole port even harder since we couldn't necessarily rely on the reference.

@filipstrand
Copy link
Owner

filipstrand commented Feb 20, 2026

@ciaranbor Out of curiosity, in the regular text2img case, have you noticed any regressions / cases where this PR produces "worse" results compared to the main (like an opposite of your Rolando case)?

@ciaranbor
Copy link
Contributor Author

No, It's always been comparable or better than current main, and there haven't been any catastrophic failures like the Ronaldo case on main. It's difficult to be confident about it though.

Mostly what I did was serialize tensors throughout the diffusion process and compare between diffusers and mflux, and kept iterating until I was confident the remaining divergence was explained by floating point differences. Since it seems quite possible that there are issues with the diffusers implementation too (or the model is just bad) it's hard to say if it's trully better

@filipstrand
Copy link
Owner

Good to hear. Then I feel pretty confident with merging this. I'm running the full test suite right now with some updated test images, and if everything looks fine then it should good to merge.

@filipstrand
Copy link
Owner

Test suite looks good with updated reference! Thank you @ciaranbor for taking the time to fix this! I will include this in the 16.6 release along with the already merged SeedVR2 7B model

@filipstrand filipstrand merged commit 7290eb6 into filipstrand:main Feb 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants