Feature/z image i2l Z-Image Image-to-LoRA (i2L) — Native MLX Implementation by azrahello · Pull Request #361 · filipstrand/mflux

azrahello · 2026-02-17T16:58:25Z

This is a native MLX port of Z-Image's Image-to-LoRA (i2L) pipeline for Apple Silicon. It runs three models entirely on-device — SigLIP2 (1.16B params) for style features, DINOv3 (6.72B) for visual features, and an i2L decoder (1.61B) that converts those embeddings into LoRA weights. The whole thing processes 4 reference images in about 2 seconds on an M2 Ultra, producing a style LoRA you can immediately use with Z-Image for generation. Weights are downloaded from HuggingFace and cached locally. I also added support for loading LoRA files generated by DiffSynth-Studio's official Space.
I wanted to try this really interesting little tool (19gb)— I've been testing it and it seems to work very well. It's somewhat reminiscent of Redux for Flux, but much more versatile and powerful

Z-Image Image-to-LoRA (i2L)

What is i2L?

Image-to-LoRA lets you turn a set of reference images into a style LoRA — no training, no GPU, no config files. You feed it a few photos that share a visual style (illustration, film grain, watercolor, anime, a specific photographer's look…) and it produces a .safetensors LoRA you can immediately use for generation. The whole process takes about 2 seconds on an M2 Ultra.

Think of it as a style extraction tool: instead of training a LoRA for hours, i2L encodes the visual DNA of your reference images into LoRA weights in one forward pass. It's similar in spirit to Redux for Flux, but built specifically for Z-Image and considerably more versatile.

How it works

Two vision encoders analyze your reference images from different angles — SigLIP2 captures high-level style and aesthetics, DINOv3 captures structural and semantic features. Their outputs are concatenated and fed to a decoder that directly outputs LoRA weight matrices, ready to apply to the Z-Image transformer.

reference images → SigLIP2 (style) + DINOv3 (structure) → i2L decoder → LoRA weights

More images = richer style representation. Each image contributes rank 4, so 4 images produce a rank-16 LoRA (~76 MB), 7 images produce rank 28 (~133 MB), and so on.

Usage

bash

# Extract style from a folder of reference images
mflux-z-image-i2l --image-path ./my_style_images --output style.safetensors

# Generate with that style
mflux-generate-z-image-turbo --prompt "a cat in a garden" --lora-paths style.safetensors

# You can mix directories and individual files
mflux-z-image-i2l --image-path ./photos ./extra/sketch.png

What's included

Three models ported to MLX: SigLIP2-G384 (1.16B), DINOv3-7B (6.72B), i2L decoder (1.61B)
CLI command: mflux-z-image-i2l — accepts directories, files, or a mix
DiffSynth-Studio LoRA compatibility: added .default. naming patterns to ZImageLoRAMapping, so LoRA files from the official HF Space also load directly

Models are downloaded from HuggingFace on first run (~19 GB total) and cached locally.

Notes

The published checkpoint produces base rank 4 per image. A higher-rank architecture exists in DiffSynth-Studio's codebase but its weights haven't been released.
The official examples recommend Z-Image (50 steps, cfg_scale=4) for best i2L results. The LoRA also works with Z-Image Turbo (8 steps, cfg_scale=1), though results may differ.

Generate with that style

mflux-generate-z-image-turbo --prompt "a cat in a garden" --lora-paths style.safetensors

You can mix directories and individual files

mflux-z-image-i2l --image-path ./photos ./extra/sketch.png
What's included

Three models ported to MLX: SigLIP2-G384 (1.16B), DINOv3-7B (6.72B), i2L decoder (1.61B)
CLI command: mflux-z-image-i2l — accepts directories, files, or a mix
DiffSynth-Studio LoRA compatibility: added .default. naming patterns to ZImageLoRAMapping, so LoRA files from the official HF Space also load directly
Bug fix: cached text encodings to survive --low-ram text encoder deletion across multiple seeds

Performance (M2 Ultra, 64GB)
StageTimeModel loading (cached)~~1.6sEncoding 4 images~~1.9sLoRA decoding~~0.1sTotal~~3.6s
Models are downloaded from HuggingFace on first run (~19 GB total) and cached locally.
Notes

The published checkpoint produces base rank 4 per image. A higher-rank architecture exists in DiffSynth-Studio's codebase but its weights haven't been released.
The official examples recommend Z-Image (50 steps, cfg_scale=4) for best i2L results. The LoRA also works with Z-Image Turbo (8 steps, cfg_scale=1), though results may differ.

Summary

Port of DiffSynth-Studio's Z-Image i2L pipeline to MLX for Apple Silicon. Encodes style reference images into LoRA weights that can be applied to Z-Image for style transfer during generation — entirely on-device, no GPU server required.

Usage

# Generate a style LoRA from reference images
mflux-z-image-i2l --image-path ./my_style_images --output style.safetensors

# Use it for generation
mflux-generate-z-image-turbo --prompt "a cat" --lora-paths style.safetensors

Accepts directories, individual files, or a mix:

mflux-z-image-i2l --image-path ./dir_a ./dir_b photo.jpg

Architecture

Three models ported to MLX:

Model	Parameters	Role	Output
SigLIP2-G384	1.16B	Style feature extraction	(B, 1536)
DINOv3-7B	6.72B	Visual feature extraction	(B, 4096)
i2L Decoder	1.61B	Embedding → LoRA weights	476 weight tensors

Pipeline flow:

images → SigLIP2 (384px) ──→ style emb (1536d) ──┐
                                                    ├→ concat (5632d) → i2L decoder → LoRA → .safetensors
images → DINOv3 (224px) ──→ visual emb (4096d) ──┘

Multi-image merge: Following DiffSynth-Studio's strategy, multiple images are merged by concatenation (not averaging), increasing the effective LoRA rank. Each image contributes rank 4, so N images produce rank 4·N.

Images	Rank	File size
1	4	~19 MB
4	16	~76 MB
7	28	~133 MB

Performance (M2 Ultra, 64GB)

Stage	Time
Model loading (cached)	~1.6s
Encoding (4 images)	~1.9s
LoRA decoding	~0.1s
Total	~3.6s

Models Implemented

SigLIP2-G384 (8 files)

Vision transformer: 40 layers, hidden=1536, 16 heads, patch_size=16
Multi-head attention pooling head with learnable probe
Weights from DiffSynth-Studio/General-Image-Encoders

DINOv3-7B (8 files)

Vision transformer: 40 layers, hidden=4096, 32 heads
2D RoPE, Gated MLP (SiLU), 4 register tokens, LayerScale
Weights from DiffSynth-Studio/General-Image-Encoders

i2L Decoder (1 file)

MLP network: embedding (5632d) → compressed MLPs → LoRA A/B weight pairs
34 transformer blocks × 7 LoRA targets = 238 weight pairs (476 tensors)
Weights from DiffSynth-Studio/Z-Image-i2L

Weight Loading

Custom standalone loader handles three HuggingFace repos:

SigLIP2: Splits fused in_proj_weight into Q/K/V projections, transposes Conv2d
DINOv3: Renames lambda1 → gamma for LayerScale, transposes Conv2d
i2L Decoder: Direct 1:1 name mapping

All weights downloaded via hf_hub_download and cached in ~/.cache/huggingface/hub.

LoRA Compatibility

Generated LoRA files use the standard mflux naming convention:

layers.{N}.attention.to_q.lora_A.weight
layers.{N}.attention.to_q.lora_B.weight
layers.{N}.feed_forward.w1.lora_A.weight
...

Also added .default. pattern variants to ZImageLoRAMapping so LoRA files from DiffSynth-Studio's Space (which use lora_A.default.weight) can be loaded directly by mflux without renaming.

Files Changed

27 files changed, 1423 insertions(+)

New files

Path	Description
`src/mflux/models/z_image/cli/z_image_i2l.py`	CLI entrypoint
`src/.../z_image_i2l/siglip2/` (7 files)	SigLIP2-G384 encoder
`src/.../z_image_i2l/dinov3/` (7 files)	DINOv3-7B encoder
`src/.../z_image_i2l/i2l_decoder/i2l_decoder.py`	i2L decoder model
`src/.../z_image_i2l/i2l_pipeline.py`	Pipeline orchestration
`src/.../z_image_i2l/i2l_weight_loader.py`	Weight downloading and transformation
`src/.../z_image_i2l/i2l_image_preprocessor.py`	Image preprocessing

Modified files

Path	Description
`pyproject.toml`	Added `mflux-z-image-i2l` entry point
`src/.../z_image_lora_mapping.py`	Added `.default.` pattern variants

Test files

Path	Description
`test_i2l_weight_loading.py`	Unit test: weight loading + forward pass for all 3 models
`test_i2l_e2e.py`	End-to-end test: sample images → LoRA file

Notes

The published DiffSynth-Studio/Z-Image-i2L checkpoint produces base rank 4 LoRA. A higher-rank model (ZImageImage2LoRAModelCompressed, rank 32) exists in the codebase but has not been published as a checkpoint.
The official DiffSynth-Studio examples use Z-Image (50 steps, cfg_scale=4) rather than Z-Image Turbo for i2L generation, with positive_only_lora (not yet implemented in mflux). However, the LoRA also works with Z-Image Turbo.

…i2L decoder

…in] format as PyTorch

The i2L decoder now generates keys like: layers.0.attention.to_q.lora_A.weight instead of: layers.0.attention.to_q.lora_A.default.weight This matches the patterns in ZImageLoRAMapping and ensures the generated LoRA files are directly loadable by mflux.

Accepts files, directories, or a mix: mflux-z-image-i2l --image-path ./style_images/ mflux-z-image-i2l --image-path img1.jpg img2.jpg mflux-z-image-i2l --image-path ./dir/ extra.png Scans directories for jpg/png/webp/bmp/tiff files, sorted by name.

The i2L model produces rank-4 LoRA (19MB). The --lora-scale flag multiplies all weights before saving, allowing style intensity control without changing the model architecture. Note: the published DiffSynth-Studio/Z-Image-i2L checkpoint produces rank-4 LoRA. Rank-32 requires ZImageImage2LoRAModelCompressed which has not been published as a checkpoint.

Instead of averaging LoRA weights, concatenate them: - lora_A: concat along axis 0 (rank grows: 4*N for N images) - lora_B: concat along axis 1 - lora_A scaled by alpha=1/N 4 images now produce rank-16 LoRA (~76MB) instead of rank-4 (~19MB). This matches the official DiffSynth-Studio merge_lora() behavior.

… mapping Adds .lora_A.default.weight / .lora_B.default.weight pattern variants so LoRA files generated by DiffSynth-Studio Space or i2L can be loaded directly by mflux without renaming.

These are unnecessary: - --lora-scale: already available via --lora-scales at generation time - --rank: determined naturally by number of images (rank = 4 * N) CLI is now minimal: --image-path and --output only.

…rand#361 Adds mflux-z-image-i2l CLI command that generates LoRA adapters from reference images using SigLIP2 + DINOv3 + i2L decoder, entirely on-device. Also adds .default. LoRA naming support to ZImageLoRAMapping.

filipstrand · 2026-02-18T12:44:39Z

Interesting! I remember reading a short tweet about this thinking I would eventually look into it (still haven't), but I honestly haven't seen much buzz about it since. Very encouraging to hear it works well for your use cases, I will try it out

filipstrand · 2026-02-20T18:44:00Z

@azrahello Tried this out briefly on a couple of paintings with a distinct style yesterday but did not get any good results at all (I expected the style to transfer at least somewhat but I basically got more or less the same as with no LoRA). I was impressed by how fast it produces the LoRA itself - basically instant - but not with the quality. I feel like maybe I'm using it wrong.

I only ran these 2 commands:

# Extract style from a folder of reference images
mflux-z-image-i2l --image-path ./my_style_images --output style.safetensors

# Generate with that style
mflux-generate-z-image-turbo --prompt "a cat in a garden" --lora-paths style.safetensors

I'll give it some more tries over the weekend, but is there anything else to keep in mind? For example, is there any LoRA trigger word one can add perhaps? As typical, it might be very dependent on the dataset...

azrahello · 2026-02-20T18:59:41Z

The procedure is correct and so is the command. So it also gave me 'inconsistent' results — in some cases it works well, in others it's as if it weren't there at all. I also had the feeling that they were somehow 'diluted' — I'm used to using very low LoRA values, and this allows me to stack many of them while maintaining quality, until I eventually reach a consistent result over time. Before starting to test it in a more personal way, I obviously ran a test with the images as per their guide, and indeed the style was transferred. I'm attaching the file. The things that seem to matter are: the resolution — it prefers 1024x1024, with different resolutions it generates lower quality LoRAs because somehow a centering and relative crop happens automatically. I also wanted to do a comparison with Flux Redux. And yes, the dataset needs to be disgustingly coherent, otherwise I don't think it understands anything, and no, there are no words or trigger tokens. But — and I was thinking about this just now — is it possible to extract them from one of the steps? I don't know how SigLIP and DINOv3 work, whether it's possible to extract from their 'internals' what they tokenize. It would be interesting to intercept them. edit The guide suggests up to 4-5 images. I pushed it up to 11 (obviously the LoRA grows in size). I was then doing other tests on DT and there too the LoRAs work in some way. But the moment I add several things I start getting strange results. Also, what I don't like is the size — 19 gigabytes, if it doesn't work properly it's just dead weight. Maybe it could be the starting point for a sort of mflux-tools :P

azrahello added 14 commits February 17, 2026 10:43

feat(i2l): add MLX model definitions for SigLIP2-G384, DINOv3-7B and …

afc588e

…i2L decoder

feat(i2l): add weight loader and end-to-end loading test

2f1caef

fix(siglip2): add batch dim to position embedding broadcast

ff917ab

fix(siglip2): skip transpose for embedding weights in weight loader

5fdf8d3

fix(weights): remove incorrect Linear transpose - MLX uses same [out,…

05059b7

…in] format as PyTorch

feat(i2l): add pipeline, image preprocessor, CLI and e2e test

312717e

feat(i2l): add --rank flag to control output LoRA rank

e797549

feat(lora): support DiffSynth-Studio .default. naming in Z-Image LoRA…

2c94ef3

… mapping Adds .lora_A.default.weight / .lora_B.default.weight pattern variants so LoRA files generated by DiffSynth-Studio Space or i2L can be loaded directly by mflux without renaming.

refactor(i2l): remove --rank and --lora-scale from CLI and pipeline

79f4076

These are unnecessary: - --lora-scale: already available via --lora-scales at generation time - --rank: determined naturally by number of images (rank = 4 * N) CLI is now minimal: --image-path and --output only.

style(cli): remove trailing slashes from CLI examples

a19d647

azrahello added 2 commits February 18, 2026 07:01

fix(z-image): cache text encodings to survive --low-ram deletion

dae85c5

docs: add Image-to-LoRA (i2L) section to Z-Image README

a94d706

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/z image i2l Z-Image Image-to-LoRA (i2L) — Native MLX Implementation#361

Feature/z image i2l Z-Image Image-to-LoRA (i2L) — Native MLX Implementation#361
azrahello wants to merge 16 commits intofilipstrand:mainfrom
azrahello:feature/z-image-i2l

azrahello commented Feb 17, 2026 •

edited

Loading

Uh oh!

filipstrand commented Feb 18, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

azrahello commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

azrahello commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Z-Image Image-to-LoRA (i2L)

What is i2L?

How it works

Usage

What's included

Notes

Generate with that style

You can mix directories and individual files

Summary

Usage

Architecture

Performance (M2 Ultra, 64GB)

Models Implemented

SigLIP2-G384 (8 files)

DINOv3-7B (8 files)

i2L Decoder (1 file)

Weight Loading

LoRA Compatibility

Files Changed

New files

Modified files

Test files

Notes

Uh oh!

filipstrand commented Feb 18, 2026

Uh oh!

filipstrand commented Feb 20, 2026

Uh oh!

azrahello commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

azrahello commented Feb 17, 2026 •

edited

Loading

azrahello commented Feb 20, 2026 •

edited

Loading