Feature/z image i2l Z-Image Image-to-LoRA (i2L) — Native MLX Implementation#361
Feature/z image i2l Z-Image Image-to-LoRA (i2L) — Native MLX Implementation#361azrahello wants to merge 16 commits intofilipstrand:mainfrom
Conversation
…in] format as PyTorch
The i2L decoder now generates keys like: layers.0.attention.to_q.lora_A.weight instead of: layers.0.attention.to_q.lora_A.default.weight This matches the patterns in ZImageLoRAMapping and ensures the generated LoRA files are directly loadable by mflux.
Accepts files, directories, or a mix: mflux-z-image-i2l --image-path ./style_images/ mflux-z-image-i2l --image-path img1.jpg img2.jpg mflux-z-image-i2l --image-path ./dir/ extra.png Scans directories for jpg/png/webp/bmp/tiff files, sorted by name.
The i2L model produces rank-4 LoRA (19MB). The --lora-scale flag multiplies all weights before saving, allowing style intensity control without changing the model architecture. Note: the published DiffSynth-Studio/Z-Image-i2L checkpoint produces rank-4 LoRA. Rank-32 requires ZImageImage2LoRAModelCompressed which has not been published as a checkpoint.
Instead of averaging LoRA weights, concatenate them: - lora_A: concat along axis 0 (rank grows: 4*N for N images) - lora_B: concat along axis 1 - lora_A scaled by alpha=1/N 4 images now produce rank-16 LoRA (~76MB) instead of rank-4 (~19MB). This matches the official DiffSynth-Studio merge_lora() behavior.
… mapping Adds .lora_A.default.weight / .lora_B.default.weight pattern variants so LoRA files generated by DiffSynth-Studio Space or i2L can be loaded directly by mflux without renaming.
These are unnecessary: - --lora-scale: already available via --lora-scales at generation time - --rank: determined naturally by number of images (rank = 4 * N) CLI is now minimal: --image-path and --output only.
…rand#361 Adds mflux-z-image-i2l CLI command that generates LoRA adapters from reference images using SigLIP2 + DINOv3 + i2L decoder, entirely on-device. Also adds .default. LoRA naming support to ZImageLoRAMapping.
|
Interesting! I remember reading a short tweet about this thinking I would eventually look into it (still haven't), but I honestly haven't seen much buzz about it since. Very encouraging to hear it works well for your use cases, I will try it out |
|
@azrahello Tried this out briefly on a couple of paintings with a distinct style yesterday but did not get any good results at all (I expected the style to transfer at least somewhat but I basically got more or less the same as with no LoRA). I was impressed by how fast it produces the LoRA itself - basically instant - but not with the quality. I feel like maybe I'm using it wrong. I only ran these 2 commands: I'll give it some more tries over the weekend, but is there anything else to keep in mind? For example, is there any LoRA trigger word one can add perhaps? As typical, it might be very dependent on the dataset... |

This is a native MLX port of Z-Image's Image-to-LoRA (i2L) pipeline for Apple Silicon. It runs three models entirely on-device — SigLIP2 (1.16B params) for style features, DINOv3 (6.72B) for visual features, and an i2L decoder (1.61B) that converts those embeddings into LoRA weights. The whole thing processes 4 reference images in about 2 seconds on an M2 Ultra, producing a style LoRA you can immediately use with Z-Image for generation. Weights are downloaded from HuggingFace and cached locally. I also added support for loading LoRA files generated by DiffSynth-Studio's official Space.
I wanted to try this really interesting little tool (19gb)— I've been testing it and it seems to work very well. It's somewhat reminiscent of Redux for Flux, but much more versatile and powerful
Z-Image Image-to-LoRA (i2L)
What is i2L?
Image-to-LoRA lets you turn a set of reference images into a style LoRA — no training, no GPU, no config files. You feed it a few photos that share a visual style (illustration, film grain, watercolor, anime, a specific photographer's look…) and it produces a
.safetensorsLoRA you can immediately use for generation. The whole process takes about 2 seconds on an M2 Ultra.Think of it as a style extraction tool: instead of training a LoRA for hours, i2L encodes the visual DNA of your reference images into LoRA weights in one forward pass. It's similar in spirit to Redux for Flux, but built specifically for Z-Image and considerably more versatile.
How it works
Two vision encoders analyze your reference images from different angles — SigLIP2 captures high-level style and aesthetics, DINOv3 captures structural and semantic features. Their outputs are concatenated and fed to a decoder that directly outputs LoRA weight matrices, ready to apply to the Z-Image transformer.
More images = richer style representation. Each image contributes rank 4, so 4 images produce a rank-16 LoRA (~76 MB), 7 images produce rank 28 (~133 MB), and so on.
Usage
What's included
mflux-z-image-i2l— accepts directories, files, or a mix.default.naming patterns toZImageLoRAMapping, so LoRA files from the official HF Space also load directlyModels are downloaded from HuggingFace on first run (~19 GB total) and cached locally.
Notes
- The published checkpoint produces base rank 4 per image. A higher-rank architecture exists in DiffSynth-Studio's codebase but its weights haven't been released.
- The official examples recommend Z-Image (50 steps, cfg_scale=4) for best i2L results. The LoRA also works with Z-Image Turbo (8 steps, cfg_scale=1), though results may differ.
Z-Image Image-to-LoRA (i2L) What is i2L? Image-to-LoRA lets you turn a set of reference images into a style LoRA — no training, no GPU, no config files. You feed it a few photos that share a visual style (illustration, film grain, watercolor, anime, a specific photographer's look…) and it produces a .safetensors LoRA you can immediately use for generation. The whole process takes about 2 seconds on an M2 Ultra. Think of it as a style extraction tool: instead of training a LoRA for hours, i2L encodes the visual DNA of your reference images into LoRA weights in one forward pass. It's similar in spirit to Redux for Flux, but built specifically for Z-Image and considerably more versatile. How it works Two vision encoders analyze your reference images from different angles — SigLIP2 captures high-level style and aesthetics, DINOv3 captures structural and semantic features. Their outputs are concatenated and fed to a decoder that directly outputs LoRA weight matrices, ready to apply to the Z-Image transformer. reference images → SigLIP2 (style) + DINOv3 (structure) → i2L decoder → LoRA weights More images = richer style representation. Each image contributes rank 4, so 4 images produce a rank-16 LoRA (~76 MB), 7 images produce rank 28 (~133 MB), and so on. Usage bash# Extract style from a folder of reference images mflux-z-image-i2l --image-path ./my_style_images --output style.safetensorsGenerate with that style
mflux-generate-z-image-turbo --prompt "a cat in a garden" --lora-paths style.safetensors
You can mix directories and individual files
mflux-z-image-i2l --image-path ./photos ./extra/sketch.png
What's included
Three models ported to MLX: SigLIP2-G384 (1.16B), DINOv3-7B (6.72B), i2L decoder (1.61B)
CLI command: mflux-z-image-i2l — accepts directories, files, or a mix
DiffSynth-Studio LoRA compatibility: added .default. naming patterns to ZImageLoRAMapping, so LoRA files from the official HF Space also load directly
Bug fix: cached text encodings to survive --low-ram text encoder deletion across multiple seeds
Performance (M2 Ultra, 64GB)
StageTimeModel loading (cached)
1.6sEncoding 4 images1.9sLoRA decoding0.1sTotal3.6sModels are downloaded from HuggingFace on first run (~19 GB total) and cached locally.
Notes
The published checkpoint produces base rank 4 per image. A higher-rank architecture exists in DiffSynth-Studio's codebase but its weights haven't been released.
The official examples recommend Z-Image (50 steps, cfg_scale=4) for best i2L results. The LoRA also works with Z-Image Turbo (8 steps, cfg_scale=1), though results may differ.
Summary
Port of DiffSynth-Studio's Z-Image i2L pipeline to MLX for Apple Silicon. Encodes style reference images into LoRA weights that can be applied to Z-Image for style transfer during generation — entirely on-device, no GPU server required.
Usage
Accepts directories, individual files, or a mix:
Architecture
Three models ported to MLX:
Pipeline flow:
Multi-image merge: Following DiffSynth-Studio's strategy, multiple images are merged by concatenation (not averaging), increasing the effective LoRA rank. Each image contributes rank 4, so N images produce rank 4·N.
Performance (M2 Ultra, 64GB)
Models Implemented
SigLIP2-G384 (8 files)
DiffSynth-Studio/General-Image-EncodersDINOv3-7B (8 files)
DiffSynth-Studio/General-Image-Encodersi2L Decoder (1 file)
DiffSynth-Studio/Z-Image-i2LWeight Loading
Custom standalone loader handles three HuggingFace repos:
in_proj_weightinto Q/K/V projections, transposes Conv2dlambda1→gammafor LayerScale, transposes Conv2dAll weights downloaded via
hf_hub_downloadand cached in~/.cache/huggingface/hub.LoRA Compatibility
Generated LoRA files use the standard mflux naming convention:
Also added
.default.pattern variants toZImageLoRAMappingso LoRA files from DiffSynth-Studio's Space (which uselora_A.default.weight) can be loaded directly by mflux without renaming.Files Changed
New files
src/mflux/models/z_image/cli/z_image_i2l.pysrc/.../z_image_i2l/siglip2/(7 files)src/.../z_image_i2l/dinov3/(7 files)src/.../z_image_i2l/i2l_decoder/i2l_decoder.pysrc/.../z_image_i2l/i2l_pipeline.pysrc/.../z_image_i2l/i2l_weight_loader.pysrc/.../z_image_i2l/i2l_image_preprocessor.pyModified files
pyproject.tomlmflux-z-image-i2lentry pointsrc/.../z_image_lora_mapping.py.default.pattern variantsTest files
test_i2l_weight_loading.pytest_i2l_e2e.pyNotes
DiffSynth-Studio/Z-Image-i2Lcheckpoint produces base rank 4 LoRA. A higher-rank model (ZImageImage2LoRAModelCompressed, rank 32) exists in the codebase but has not been published as a checkpoint.positive_only_lora(not yet implemented in mflux). However, the LoRA also works with Z-Image Turbo.