Add wan 2.1 model by belkakari · Pull Request #1409 · ml-explore/mlx-examples

belkakari · 2026-03-11T18:26:27Z

This PR adds a WAN 2.1 text2video and image2video model support with optimizations like TeaCache and step distilled models support. Based on original WAN 2.1 implementation and LightX2V

Basic commands:

1.3B text-to-video

python txt2video.py 'A cat playing piano' --output out.mp4

14B text-to-video

python txt2video.py 'A cat playing piano' --model t2v-14B --quantize --output out_14B.mp4

14B image-to-video

python img2video.py 'Astronaut riding a horse' \
   --image ./inputs/astronaut-on-a-horse.png --quantize --output out_i2v.mp4

Step distilled models:
T2V

wget https://huggingface.co/lightx2v/Wan2.1-Distill-Models/blob/main/wan2.1_t2v_14b_lightx2v_4step.safetensors
python txt2video.py 'A cat playing piano' \
    --model t2v-14B --checkpoint ./wan2.1_t2v_14b_lightx2v_4step.safetensors \
    --sampler euler --steps 4 --guidance 1.0 \
    --quantize --output out_t2v_distilled.mp4

I2V

wget https://huggingface.co/lightx2v/Wan2.1-Distill-Models/resolve/main/wan2.1_i2v_480p_lightx2v_4step.safetensors
python img2video.py 'Astronaut riding a horse' \
    --image ./inputs/astronaut-on-a-horse.png --checkpoint ./wan2.1_i2v_480p_lightx2v_4step.safetensors \
    --sampler euler --steps 4 --guidance 1.0 --shift 5.0 \
    --quantize --output out_i2v_distilled.mp4

…list[mx.array] to mx.array where possible

angeloskath

Great work @belkakari !

I left quite a few comments, a lot of them stylistic and some of them performance related.

One more, general comment is that the weights are in float32. That seems unnecessary and will impact performance as well as memory. It also hides possible upcasting if someone wants to have the computation happen in bf16 or another dtype which will be much faster on M5s for instance.

angeloskath · 2026-03-23T00:00:38Z

video/wan2.1/wan/model.py

+def _project_time_fn(e, w, b):
+    x = nn.silu(e)
+    x = mx.matmul(x, w.T) + b
+    return x


Almost certainly no need to compile the above. Using nn.Linear layers would be more understandable and the same speed. It would also use mx.addmm which fuses the x @ w.T + b into one op. For sinusoidal embedding you can use nn.SinusoidalPositionalEncoding. You can wrap them in lists or nn.Sequential like their PyTorch counterparts. It would also help with quantization as they would be automatically quantized which won't quite happen now (unless you implement a custom to_quantized function).

angeloskath · 2026-03-23T00:01:06Z

video/wan2.1/wan/model.py

+        self.patch_embedding_weight = mx.random.normal((dim, *patch_size, in_dim)) * (
+            1.0 / (in_dim * math.prod(patch_size)) ** 0.5
+        )
+        self.patch_embedding_bias = mx.zeros((dim,))


Same goes for here, why not nn.Conv3d ?

angeloskath · 2026-03-23T00:01:42Z

video/wan2.1/wan/model.py

+                eps,
+                cross_attn_type=model_type,
+            )
+            setattr(self, f"block_{i}", block)


Why not a list ? This just makes your life hard when you want to iterate over them which you always will want to.

angeloskath · 2026-03-23T00:03:43Z

video/wan2.1/wan/model.py

+                value = mx.transpose(value, (0, 2, 3, 4, 1))
+
+            # blocks.N -> block_N
+            new_key = re.sub(r"blocks\.(\d+)\.", r"block_\1.", new_key)


Most of these are not needed when the blocks are put in a list and the ffn in a sequential (or a list) etc.

angeloskath · 2026-03-23T00:06:13Z

video/wan2.1/wan/model.py

+
+        # Merge separate Q/K/V into QKV for self-attention,
+        # and K/V into KV for cross-attention
+        remapped = WanModel._merge_qkv_weights(remapped)


That is good especially for distributed inference later but generally it doesn't provide much of a speedup. The q, k, v projections will happen in parallel in the GPU anyway.

Just to be clear, a good optimization but probably not the lowest hanging fruit.

The lowest hanging fruit is probably updating the modulation parameter in the layernorm to contain the 1 + so that the layernorm can just use self.modulation + e directly see comments in layers.py .

Got it, added modulation as you've suggested

angeloskath · 2026-03-23T00:49:46Z

video/wan2.1/img2video.py

+    )
+    parser.add_argument(
+        "--n-prompt",
+        default="镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",


Same as in txt2img.py .

angeloskath · 2026-03-23T01:19:42Z

video/wan2.1/wan/vae.py

+        # Register all layers via setattr
+        for stage_idx, stage in enumerate(self.upsamples):
+            for layer_idx, (layer_type, layer) in enumerate(stage):
+                setattr(self, f"upsample_s{stage_idx}_l{layer_idx}_{layer_type}", layer)


Same comment as for model.py , no need for these to be set as attributes they can be nested lists.

angeloskath · 2026-03-23T01:23:55Z

video/wan2.1/wan/vae.py

+            if compile and i == 1 and self._compiled_decode is None:
+                self._compiled_decode = mx.compile(self.decoder._forward_functional)
+
+            if self._compiled_decode is not None:
+                out_frame, feat_cache = self._compiled_decode(frame, feat_cache)
+            else:
+                out_frame, feat_cache = self.decoder._forward_functional(
+                    frame, feat_cache
+                )


Why not always compile?

angeloskath · 2026-03-23T01:25:09Z

video/wan2.1/wan/vae.py

+        for i in range(num_frames):
+            frame = x[:, i : i + 1, :, :, :]


Why not batch it? For memory saving?

Yes, if the video is long enough it won't fit into the memory. We can have configurable batch size as an additional parameter with default=1, wdyt?

angeloskath · 2026-03-23T01:29:02Z

video/wan2.1/wan/vae_layers.py

+        scale = 1.0 / dim**0.5
+        self.to_qkv_weight = mx.random.uniform(
+            low=-scale, high=scale, shape=(dim * 3, 1, 1, dim)
+        )
+        self.to_qkv_bias = mx.zeros((dim * 3,))
+        self.proj_weight = mx.zeros((dim, 1, 1, dim))
+        self.proj_bias = mx.zeros((dim,))


Same as before, should just be linear layers.

Gleb Sterkin added 4 commits March 5, 2026 13:38

initial wan2.1 implementation

10dca41

add image2video, distilled models and teacache support

2f57377

image2video an text2video fixes, readme update

b303a7b

reduce the amount of reshapes, make everything channel-last, changed …

fa9e5eb

…list[mx.array] to mx.array where possible

belkakari force-pushed the wan-2.1 branch from 46a1ae6 to c9fbf25 Compare March 12, 2026 11:03

remove redundant code, fix documentation

ddd567c

belkakari force-pushed the wan-2.1 branch from c9fbf25 to ddd567c Compare March 12, 2026 19:47

angeloskath reviewed Mar 23, 2026

View reviewed changes

PR fixes, pt1

89668b9

belkakari force-pushed the wan-2.1 branch 3 times, most recently from 2a3ffb5 to e4cd847 Compare March 24, 2026 13:30

belkakari requested a review from angeloskath March 24, 2026 13:32

PR review pt.2

1953e57

belkakari force-pushed the wan-2.1 branch from e4cd847 to 1953e57 Compare March 24, 2026 19:04

readme fix

9ef29a1

Conversation

belkakari commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

belkakari commented Mar 11, 2026 •

edited

Loading