Fused Qwen3 MoE layer with LoRA #2622

woct0rdho · 2025-06-29T06:31:52Z

woct0rdho
Jun 29, 2025

https://github.com/woct0rdho/transformers-qwen3-moe-fused

I'm working on implementing a fused Qwen3 MoE layer, which focuses on fine-tuning on a single GPU, while being compatible with the HF Transformers ecosystem. Just want to let you know that it's a use case of the LoRA dynamic dispatch API.

BenjaminBossan · 2025-06-30T09:52:41Z

BenjaminBossan
Jun 30, 2025
Maintainer

Nice, thanks for the info. Out of curiosity, what part of the MoE layer requires special handling?

0 replies

woct0rdho · 2025-06-30T10:54:05Z

woct0rdho
Jun 30, 2025
Author

The Qwen3 MoE model (and all other MoE models) in HF Transformers is notoriously slow, because there is a for loop in Qwen3MoeSparseMoeBlock.

The critical part of my repo is to implement the moe_fused_linear function:

output[b, o] = sum_i weight[selected_experts[b], o, i] * input[b, i]

Currently I've only written a PyTorch implementation, which allocates an intermediate array of shape (batch_size, out_features, in_features), where the batch size is actually batch_size * sequence_length * num_selected_experts, so that's a lot of memory.

It should be possible to write a Triton kernel and not allocate any intermediate memory. An AI can quickly write the kernel but I still need to optimize a few things like the layout arrangements. (Update: This is done!)

The rest are some boilerplate code to make it compatible with the Qwen3 MoE model in HF Transformers. I've written a Qwen3MoeFusedSparseMoeBlock to replace Qwen3MoeSparseMoeBlock. If you know a better way to replace this module without re-defining things like Qwen3MoeModel, please also tell me.

3 replies

BenjaminBossan Jun 30, 2025
Maintainer

Thanks for the additional info. I'm not an expert on that, so I can't really help here. I know that there is the kernels package which should make it simpler to plug in custom kernels. For general transformers optimization, I wonder if you could open a PR there with your more efficient code.

What I don't get yet is why this change requires a special LoRA layer. The reason why I'm asking is because I wonder if there is a general improvement we can make in PEFT to make it work out of the box.

woct0rdho Jun 30, 2025
Author

If I understand correctly, PEFT by default only supports the layers in peft/tuners/lora/layer.py. But I've written a MoeFusedLinear layer, which has a weight of shape (num_experts, out_features, in_features). This is the weight in the moe_fused_linear function above.

When creating a LoRA, I need to define two LoRA weights lora_A: (num_experts, lora_rank, in_features), lora_B: (num_experts, out_features, lora_rank), see my lora.py. It's like first defining a conventional LoRA for each expert, then stack them along the num_experts dimension. (When initializing the parameters, we also need to correctly compute the scaling of Kaiming uniform init.) I don't think this can be handled by the conventional LoRA with some smart reshapes, so I've written a special lora layer.

Thanks for pointing out the kernels package! I'll check that.

BenjaminBossan Jun 30, 2025
Maintainer

I see, you're absolutely correct. As the transformers implementation uses nn.Linear for the projection layers, it wouldn't be necessary there.

MoE layers are a bit of an issue for PEFT, as they often have custom implementations that don't rely on nn.Linear (sometime using nn.Parameter), which results in PEFT not being able to target them. At the same time, we really want to avoid having a custom PEFT layer for each MoE layer class out there. It's a bit of an unsolved design question for us.

danielhanchen · 2025-07-09T00:36:43Z

danielhanchen
Jul 9, 2025

Oh hi hi! Again nice work on this! Also thanks for utilizing Unsloth kernels - we haven't released or announced them yet, so always cool to see community members utilizing them!

Just a small note the MoE kernels are licensed as AGPLv3 - we decided to make Unsloth dual licensed so all code under the kernels folder is AGPLv3.

The main reason is because many other packages and companies copy and paste kernels from our repos without any credit ( ie no acknowledgements and license copyright mentions ), and so we tried doing linking via LGPL with no success, since people would sneakily fork the LGPL package and link to their fork. More details here: unslothai/unsloth#2890 (reply in thread)

0 replies

Fused Qwen3 MoE layer with LoRA #2622

Uh oh!

Uh oh!

woct0rdho Jun 29, 2025

Replies: 3 comments · 3 replies

Uh oh!

BenjaminBossan Jun 30, 2025 Maintainer

Uh oh!

Uh oh!

woct0rdho Jun 30, 2025 Author

Uh oh!

BenjaminBossan Jun 30, 2025 Maintainer

Uh oh!

Uh oh!

woct0rdho Jun 30, 2025 Author

Uh oh!

BenjaminBossan Jun 30, 2025 Maintainer

Uh oh!

danielhanchen Jul 9, 2025

woct0rdho
Jun 29, 2025

Replies: 3 comments 3 replies

BenjaminBossan
Jun 30, 2025
Maintainer

woct0rdho
Jun 30, 2025
Author

BenjaminBossan Jun 30, 2025
Maintainer

woct0rdho Jun 30, 2025
Author

BenjaminBossan Jun 30, 2025
Maintainer

danielhanchen
Jul 9, 2025