Align Qwen-Image with diffusers reference by ciaranbor · Pull Request #362 · filipstrand/mflux

ciaranbor · 2026-02-20T10:50:18Z

Hey, I've been using mflux to add image model support over at exo. Thanks for the great work!

I've been getting some strange outputs from Qwen-Image at times. I believe I've seen it mentioned in issues here too but I can't find any right now.

This PR fixes several configuration and numerical differences between the mflux Qwen-Image implementation and the diffusers reference, improving output quality for both txt2img and edit pipelines.

Diffusers also uses CONDITION_IMAGE_SIZE=1024*1024 but this made generation much slower so I left it as is.

Example before and after outputs for

PROMPT = "ronaldo"
NEGATIVE_PROMPT = " "
SEED = 2
HEIGHT, WIDTH = 1024, 1024
NUM_STEPS = 25
GUIDANCE = 4.0

Changes

Sigma shift schedule — Qwen-Image was not applying any sigma shift (requires_sigma_shift=None), using a plain linspace noise schedule. The diffusers reference applies a dynamic exponential shift with max_shift=0.9
and max_image_seq_len=8192, plus a stretch-to-terminal that pins the final sigma at 0.02. The shift parameters are now configurable per model on ModelConfig (defaults match FLUX's existing hardcoded values, so
other models are unaffected).

SDPA in text encoder — Replaced manual matmul → mask → softmax → matmul with mx.fast.scaled_dot_product_attention, avoiding materialization of the full attention weight matrix.

SDPA + float32 ROPE in vision encoder — Same SDPA replacement for both chunked (windowed) and full attention paths. ROPE is now computed in float32 to reduce precision loss from bfloat16 trig operations.

Edit prompt template — Changed use_picture_prefix from True to False to match the diffusers reference prompt format. With True, images were inserted as Picture 1: <vision_tokens> dynamically; with False, the vision
tokens are part of the template and the prompt is appended after them.

Tokenizer config — max_length increased from 1024 to 1058 and padding set to "longest" to match the diffusers reference configuration.

filipstrand · 2026-02-20T14:40:28Z

Thanks for the contribution, this looks great! I'll run a few tests on my end and merge this afterwards if anything looks good. I've also seen some strange looking output for some images but haven't had the time to prioritise this.

Once Qwen 2.0 is released to the open, I'll probably retire the older models if the newer one is uniformly better, but still very nice to have this fix!

filipstrand · 2026-02-20T16:10:07Z

I ran the image test suite and had to update some reference image (which is expected - all looked "equally as good" as before) but there is one of our test cases that has multiple input images that looks a bit off

this branch (one input image, as reference):

current main (two input images):

this branch (two input images):

But this might also be that something else/previous is actually wrong with this logic (not necessarily related to this PR). This reminds me a bit of #330, where this particular test case also looked a bit off...

Maybe this is just an unlucky choice of test image/prompt/seed, (I'll try to iterate a bit and see if this is the case). Regardless, I think your Rolando example clearly highlights that something is wrong with the current implementation, but I mostly want to see if we should update our test suite too for Qwen Image.

Edit/update: Actually, the new reference might not actually be worse now when I look at it a bit more closely..., it mostly feels like the scaling is "off" or stretched out, but actually hard to say that this is worse in a strict objective sense (I have not compared this with diffusers). Maybe the only "objective" thing to note is that with this branch there is now a bigger shift in perspective between the test_image_generation_qwen_edit (white t-shirt) and test_image_generation_qwen_edit_multiple_images (striped shirt)...

filipstrand · 2026-02-20T17:34:01Z

Some more tests

uv run mflux-generate-qwen-edit 
--prompt "Make the hand fistbump the camera instead of showing a flat palm, and the man should wear this shirt. Maintain the original pose, body position, and overall stance." 
--image-paths tests/resources/reference_upscaled.png tests/resources/shirt.jpg 
--width 640 
--height 384 
--steps 20 
--guidance 2.5 
--seed 1111 2222 3333 
-q 8

I can also reproduce your example above, which is objectively better, so I'm happy to merge this PR. I wrote about this somewhere, but specifically for Qwen-Image, I have some doubts over my current implementation in general. Out of all models in MFLUX this one has been the trickiest to debug and work with in general (also because of its size). It also has responded quite bad to stronger quantisation compared to Flux/Z-image. It was also built back in summer when coding tools were substantially worse compared to now.

When the 2.0 model is released, I look forward to do a fresh implementation with that one and hope it performs better on all fronts so we can deprecate the old one.

ciaranbor · 2026-02-20T18:19:57Z

The issues are hard to reproduce, only happening for certain seeds and prompts. The changes seem to help more for txt2img than editing.

Out of interest I ran the test using diffusers and this is what I got:

which is quite different again.

filipstrand · 2026-02-20T18:22:04Z

Tried running this with the Flux2 test image just to see what it produced

uv run mflux-generate-qwen-edit 
--prompt "Make the woman wear the eyeglasses (regular glasses, not sunglasses)" 
--image-paths tests/resources/unsplash_person.jpg tests/resources/glasses.jpg 
--width 1344 
--height 896 
--steps 20 
--guidance 1.0 
--seed 45 46 47 
-q 8

Both are honestly pretty bad in both cases, perhaps more variation in the types of glasses (compared to reference input glasses) in this PR, but I honestly don't know how much it says.

This is Flu2Klein in comparison - much more consistent, much faster (same settings but 4 steps instead of 20):

uv run mflux-generate-flux2-edit 
--model flux2-klein-9b 
--prompt "Make the woman wear the eyeglasses (regular glasses, not sunglasses)" 
--image-paths tests/resources/unsplash_person.jpg tests/resources/glasses.jpg 
--width 1344 
--height 896 
--steps 4 
--guidance 1.0 
--seed 45 46 47 
-q 8

filipstrand · 2026-02-20T18:25:54Z

Out of interest I ran the test using diffusers and this is what I got

Oh yes, I remember that "zoomed in" effect which also seemed very weird from diffusers (I remember I saw that same effect on other images too)... That made the whole port even harder since we couldn't necessarily rely on the reference.

filipstrand · 2026-02-20T18:29:04Z

@ciaranbor Out of curiosity, in the regular text2img case, have you noticed any regressions / cases where this PR produces "worse" results compared to the main (like an opposite of your Rolando case)?

ciaranbor · 2026-02-20T18:39:07Z

No, It's always been comparable or better than current main, and there haven't been any catastrophic failures like the Ronaldo case on main. It's difficult to be confident about it though.

Mostly what I did was serialize tensors throughout the diffusion process and compare between diffusers and mflux, and kept iterating until I was confident the remaining divergence was explained by floating point differences. Since it seems quite possible that there are issues with the diffusers implementation too (or the model is just bad) it's hard to say if it's trully better

filipstrand · 2026-02-20T18:52:17Z

Good to hear. Then I feel pretty confident with merging this. I'm running the full test suite right now with some updated test images, and if everything looks fine then it should good to merge.

filipstrand · 2026-02-20T19:17:07Z

Test suite looks good with updated reference! Thank you @ciaranbor for taking the time to fix this! I will include this in the 16.6 release along with the already merged SeedVR2 7B model

ciaranbor added 5 commits February 19, 2026 19:35

Use sigma shift from diffusers

38212a3

Use SDPA for qwen attention

1e8fc1e

Use SDPA for qwen vision attention and use float32 for ROPE

d3c9e70

use picture prefix to match diffusers

0c2c8c6

Tokenizer config to match diffusers

9a29444

Update reference test images

53d1d4d

filipstrand merged commit 7290eb6 into filipstrand:main Feb 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align Qwen-Image with diffusers reference#362

Align Qwen-Image with diffusers reference#362
filipstrand merged 6 commits intofilipstrand:mainfrom
ciaranbor:fix-qwen

ciaranbor commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026 •

edited

Loading

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

ciaranbor commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026 •

edited

Loading

Uh oh!

ciaranbor commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ciaranbor commented Feb 20, 2026

Changes

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

ciaranbor commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciaranbor commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

filipstrand commented Feb 20, 2026 •

edited

Loading

filipstrand commented Feb 20, 2026 •

edited

Loading