Skip to content

Feature/z image i2l Z-Image Image-to-LoRA (i2L) — Native MLX Implementation#361

Open
azrahello wants to merge 16 commits intofilipstrand:mainfrom
azrahello:feature/z-image-i2l
Open

Feature/z image i2l Z-Image Image-to-LoRA (i2L) — Native MLX Implementation#361
azrahello wants to merge 16 commits intofilipstrand:mainfrom
azrahello:feature/z-image-i2l

Conversation

@azrahello
Copy link
Contributor

@azrahello azrahello commented Feb 17, 2026

This is a native MLX port of Z-Image's Image-to-LoRA (i2L) pipeline for Apple Silicon. It runs three models entirely on-device — SigLIP2 (1.16B params) for style features, DINOv3 (6.72B) for visual features, and an i2L decoder (1.61B) that converts those embeddings into LoRA weights. The whole thing processes 4 reference images in about 2 seconds on an M2 Ultra, producing a style LoRA you can immediately use with Z-Image for generation. Weights are downloaded from HuggingFace and cached locally. I also added support for loading LoRA files generated by DiffSynth-Studio's official Space.
I wanted to try this really interesting little tool (19gb)— I've been testing it and it seems to work very well. It's somewhat reminiscent of Redux for Flux, but much more versatile and powerful

Z-Image Image-to-LoRA (i2L)

What is i2L?

Image-to-LoRA lets you turn a set of reference images into a style LoRA — no training, no GPU, no config files. You feed it a few photos that share a visual style (illustration, film grain, watercolor, anime, a specific photographer's look…) and it produces a .safetensors LoRA you can immediately use for generation. The whole process takes about 2 seconds on an M2 Ultra.

Think of it as a style extraction tool: instead of training a LoRA for hours, i2L encodes the visual DNA of your reference images into LoRA weights in one forward pass. It's similar in spirit to Redux for Flux, but built specifically for Z-Image and considerably more versatile.

How it works

Two vision encoders analyze your reference images from different angles — SigLIP2 captures high-level style and aesthetics, DINOv3 captures structural and semantic features. Their outputs are concatenated and fed to a decoder that directly outputs LoRA weight matrices, ready to apply to the Z-Image transformer.

reference images → SigLIP2 (style) + DINOv3 (structure) → i2L decoder → LoRA weights

More images = richer style representation. Each image contributes rank 4, so 4 images produce a rank-16 LoRA (~76 MB), 7 images produce rank 28 (~133 MB), and so on.

Usage

bash
# Extract style from a folder of reference images
mflux-z-image-i2l --image-path ./my_style_images --output style.safetensors

# Generate with that style
mflux-generate-z-image-turbo --prompt "a cat in a garden" --lora-paths style.safetensors

# You can mix directories and individual files
mflux-z-image-i2l --image-path ./photos ./extra/sketch.png

What's included

  • Three models ported to MLX: SigLIP2-G384 (1.16B), DINOv3-7B (6.72B), i2L decoder (1.61B)
  • CLI command: mflux-z-image-i2l — accepts directories, files, or a mix
  • DiffSynth-Studio LoRA compatibility: added .default. naming patterns to ZImageLoRAMapping, so LoRA files from the official HF Space also load directly
  • Stage | Time -- | -- Model loading (cached) | ~1.6s Encoding 4 images | ~1.9s LoRA decoding | ~0.1s Total | ~3.6s

    Models are downloaded from HuggingFace on first run (~19 GB total) and cached locally.

    Notes

    • The published checkpoint produces base rank 4 per image. A higher-rank architecture exists in DiffSynth-Studio's codebase but its weights haven't been released.
    • The official examples recommend Z-Image (50 steps, cfg_scale=4) for best i2L results. The LoRA also works with Z-Image Turbo (8 steps, cfg_scale=1), though results may differ.
    Z-Image Image-to-LoRA (i2L) What is i2L? Image-to-LoRA lets you turn a set of reference images into a style LoRA — no training, no GPU, no config files. You feed it a few photos that share a visual style (illustration, film grain, watercolor, anime, a specific photographer's look…) and it produces a .safetensors LoRA you can immediately use for generation. The whole process takes about 2 seconds on an M2 Ultra. Think of it as a style extraction tool: instead of training a LoRA for hours, i2L encodes the visual DNA of your reference images into LoRA weights in one forward pass. It's similar in spirit to Redux for Flux, but built specifically for Z-Image and considerably more versatile. How it works Two vision encoders analyze your reference images from different angles — SigLIP2 captures high-level style and aesthetics, DINOv3 captures structural and semantic features. Their outputs are concatenated and fed to a decoder that directly outputs LoRA weight matrices, ready to apply to the Z-Image transformer. reference images → SigLIP2 (style) + DINOv3 (structure) → i2L decoder → LoRA weights More images = richer style representation. Each image contributes rank 4, so 4 images produce a rank-16 LoRA (~76 MB), 7 images produce rank 28 (~133 MB), and so on. Usage bash# Extract style from a folder of reference images mflux-z-image-i2l --image-path ./my_style_images --output style.safetensors

    Generate with that style

    mflux-generate-z-image-turbo --prompt "a cat in a garden" --lora-paths style.safetensors

    You can mix directories and individual files

    mflux-z-image-i2l --image-path ./photos ./extra/sketch.png
    What's included

    Three models ported to MLX: SigLIP2-G384 (1.16B), DINOv3-7B (6.72B), i2L decoder (1.61B)
    CLI command: mflux-z-image-i2l — accepts directories, files, or a mix
    DiffSynth-Studio LoRA compatibility: added .default. naming patterns to ZImageLoRAMapping, so LoRA files from the official HF Space also load directly
    Bug fix: cached text encodings to survive --low-ram text encoder deletion across multiple seeds

    Performance (M2 Ultra, 64GB)
    StageTimeModel loading (cached)1.6sEncoding 4 images1.9sLoRA decoding0.1sTotal3.6s
    Models are downloaded from HuggingFace on first run (~19 GB total) and cached locally.
    Notes

    The published checkpoint produces base rank 4 per image. A higher-rank architecture exists in DiffSynth-Studio's codebase but its weights haven't been released.
    The official examples recommend Z-Image (50 steps, cfg_scale=4) for best i2L results. The LoRA also works with Z-Image Turbo (8 steps, cfg_scale=1), though results may differ.

    Summary

    Port of DiffSynth-Studio's Z-Image i2L pipeline to MLX for Apple Silicon. Encodes style reference images into LoRA weights that can be applied to Z-Image for style transfer during generation — entirely on-device, no GPU server required.

    Usage

    # Generate a style LoRA from reference images
    mflux-z-image-i2l --image-path ./my_style_images --output style.safetensors
    
    # Use it for generation
    mflux-generate-z-image-turbo --prompt "a cat" --lora-paths style.safetensors

    Accepts directories, individual files, or a mix:

    mflux-z-image-i2l --image-path ./dir_a ./dir_b photo.jpg

    Architecture

    Three models ported to MLX:

    Model Parameters Role Output
    SigLIP2-G384 1.16B Style feature extraction (B, 1536)
    DINOv3-7B 6.72B Visual feature extraction (B, 4096)
    i2L Decoder 1.61B Embedding → LoRA weights 476 weight tensors

    Pipeline flow:

    images → SigLIP2 (384px) ──→ style emb (1536d) ──┐
                                                        ├→ concat (5632d) → i2L decoder → LoRA → .safetensors
    images → DINOv3 (224px) ──→ visual emb (4096d) ──┘
    

    Multi-image merge: Following DiffSynth-Studio's strategy, multiple images are merged by concatenation (not averaging), increasing the effective LoRA rank. Each image contributes rank 4, so N images produce rank 4·N.

    Images Rank File size
    1 4 ~19 MB
    4 16 ~76 MB
    7 28 ~133 MB

    Performance (M2 Ultra, 64GB)

    Stage Time
    Model loading (cached) ~1.6s
    Encoding (4 images) ~1.9s
    LoRA decoding ~0.1s
    Total ~3.6s

    Models Implemented

    SigLIP2-G384 (8 files)

    • Vision transformer: 40 layers, hidden=1536, 16 heads, patch_size=16
    • Multi-head attention pooling head with learnable probe
    • Weights from DiffSynth-Studio/General-Image-Encoders

    DINOv3-7B (8 files)

    • Vision transformer: 40 layers, hidden=4096, 32 heads
    • 2D RoPE, Gated MLP (SiLU), 4 register tokens, LayerScale
    • Weights from DiffSynth-Studio/General-Image-Encoders

    i2L Decoder (1 file)

    • MLP network: embedding (5632d) → compressed MLPs → LoRA A/B weight pairs
    • 34 transformer blocks × 7 LoRA targets = 238 weight pairs (476 tensors)
    • Weights from DiffSynth-Studio/Z-Image-i2L

    Weight Loading

    Custom standalone loader handles three HuggingFace repos:

    • SigLIP2: Splits fused in_proj_weight into Q/K/V projections, transposes Conv2d
    • DINOv3: Renames lambda1gamma for LayerScale, transposes Conv2d
    • i2L Decoder: Direct 1:1 name mapping

    All weights downloaded via hf_hub_download and cached in ~/.cache/huggingface/hub.

    LoRA Compatibility

    Generated LoRA files use the standard mflux naming convention:

    layers.{N}.attention.to_q.lora_A.weight
    layers.{N}.attention.to_q.lora_B.weight
    layers.{N}.feed_forward.w1.lora_A.weight
    ...
    

    Also added .default. pattern variants to ZImageLoRAMapping so LoRA files from DiffSynth-Studio's Space (which use lora_A.default.weight) can be loaded directly by mflux without renaming.

    Files Changed

    27 files changed, 1423 insertions(+)
    

    New files

    Path Description
    src/mflux/models/z_image/cli/z_image_i2l.py CLI entrypoint
    src/.../z_image_i2l/siglip2/ (7 files) SigLIP2-G384 encoder
    src/.../z_image_i2l/dinov3/ (7 files) DINOv3-7B encoder
    src/.../z_image_i2l/i2l_decoder/i2l_decoder.py i2L decoder model
    src/.../z_image_i2l/i2l_pipeline.py Pipeline orchestration
    src/.../z_image_i2l/i2l_weight_loader.py Weight downloading and transformation
    src/.../z_image_i2l/i2l_image_preprocessor.py Image preprocessing

    Modified files

    Path Description
    pyproject.toml Added mflux-z-image-i2l entry point
    src/.../z_image_lora_mapping.py Added .default. pattern variants

    Test files

    Path Description
    test_i2l_weight_loading.py Unit test: weight loading + forward pass for all 3 models
    test_i2l_e2e.py End-to-end test: sample images → LoRA file

    Notes

    • The published DiffSynth-Studio/Z-Image-i2L checkpoint produces base rank 4 LoRA. A higher-rank model (ZImageImage2LoRAModelCompressed, rank 32) exists in the codebase but has not been published as a checkpoint.
    • The official DiffSynth-Studio examples use Z-Image (50 steps, cfg_scale=4) rather than Z-Image Turbo for i2L generation, with positive_only_lora (not yet implemented in mflux). However, the LoRA also works with Z-Image Turbo.

The i2L decoder now generates keys like:
  layers.0.attention.to_q.lora_A.weight
instead of:
  layers.0.attention.to_q.lora_A.default.weight

This matches the patterns in ZImageLoRAMapping and ensures the
generated LoRA files are directly loadable by mflux.
Accepts files, directories, or a mix:
  mflux-z-image-i2l --image-path ./style_images/
  mflux-z-image-i2l --image-path img1.jpg img2.jpg
  mflux-z-image-i2l --image-path ./dir/ extra.png

Scans directories for jpg/png/webp/bmp/tiff files, sorted by name.
The i2L model produces rank-4 LoRA (19MB). The --lora-scale flag
multiplies all weights before saving, allowing style intensity
control without changing the model architecture.

Note: the published DiffSynth-Studio/Z-Image-i2L checkpoint
produces rank-4 LoRA. Rank-32 requires ZImageImage2LoRAModelCompressed
which has not been published as a checkpoint.
Instead of averaging LoRA weights, concatenate them:
- lora_A: concat along axis 0 (rank grows: 4*N for N images)
- lora_B: concat along axis 1
- lora_A scaled by alpha=1/N

4 images now produce rank-16 LoRA (~76MB) instead of rank-4 (~19MB).
This matches the official DiffSynth-Studio merge_lora() behavior.
… mapping

Adds .lora_A.default.weight / .lora_B.default.weight pattern variants
so LoRA files generated by DiffSynth-Studio Space or i2L can be loaded
directly by mflux without renaming.
These are unnecessary:
- --lora-scale: already available via --lora-scales at generation time
- --rank: determined naturally by number of images (rank = 4 * N)

CLI is now minimal: --image-path and --output only.
terribilissimo pushed a commit to terribilissimo/mflux that referenced this pull request Feb 18, 2026
…rand#361

Adds mflux-z-image-i2l CLI command that generates LoRA adapters from
reference images using SigLIP2 + DINOv3 + i2L decoder, entirely on-device.
Also adds .default. LoRA naming support to ZImageLoRAMapping.
@filipstrand
Copy link
Owner

Interesting! I remember reading a short tweet about this thinking I would eventually look into it (still haven't), but I honestly haven't seen much buzz about it since. Very encouraging to hear it works well for your use cases, I will try it out

@filipstrand
Copy link
Owner

@azrahello Tried this out briefly on a couple of paintings with a distinct style yesterday but did not get any good results at all (I expected the style to transfer at least somewhat but I basically got more or less the same as with no LoRA). I was impressed by how fast it produces the LoRA itself - basically instant - but not with the quality. I feel like maybe I'm using it wrong.

I only ran these 2 commands:

# Extract style from a folder of reference images
mflux-z-image-i2l --image-path ./my_style_images --output style.safetensors

# Generate with that style
mflux-generate-z-image-turbo --prompt "a cat in a garden" --lora-paths style.safetensors

I'll give it some more tries over the weekend, but is there anything else to keep in mind? For example, is there any LoRA trigger word one can add perhaps? As typical, it might be very dependent on the dataset...

@azrahello
Copy link
Contributor Author

azrahello commented Feb 20, 2026

The procedure is correct and so is the command. So it also gave me 'inconsistent' results — in some cases it works well, in others it's as if it weren't there at all. I also had the feeling that they were somehow 'diluted' — I'm used to using very low LoRA values, and this allows me to stack many of them while maintaining quality, until I eventually reach a consistent result over time. Before starting to test it in a more personal way, I obviously ran a test with the images as per their guide, and indeed the style was transferred. I'm attaching the file. test_i2l_styled The things that seem to matter are: the resolution — it prefers 1024x1024, with different resolutions it generates lower quality LoRAs because somehow a centering and relative crop happens automatically. I also wanted to do a comparison with Flux Redux. And yes, the dataset needs to be disgustingly coherent, otherwise I don't think it understands anything, and no, there are no words or trigger tokens. But — and I was thinking about this just now — is it possible to extract them from one of the steps? I don't know how SigLIP and DINOv3 work, whether it's possible to extract from their 'internals' what they tokenize. It would be interesting to intercept them. edit The guide suggests up to 4-5 images. I pushed it up to 11 (obviously the LoRA grows in size). I was then doing other tests on DT and there too the LoRAs work in some way. But the moment I add several things I start getting strange results. Also, what I don't like is the size — 19 gigabytes, if it doesn't work properly it's just dead weight. Maybe it could be the starting point for a sort of mflux-tools :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants