Is it feasible to train Mamba+MoE using torchtitan? #1622

imhuim982 · 2025-08-22T02:19:36Z

imhuim982
Aug 22, 2025

Torchtitan is a very exciting project, and its support for native torch models is more researcher-friendly. I am currently working on a mamba+moe+sparse attention architecture, and I want to train a model with 15B parameters and about 1.5B activations. However, I am not sure if expert parallelism is well supported, and whether architectures like Mamba are fully compatible. I would greatly appreciate it if someone could give some suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it feasible to train Mamba+MoE using torchtitan? #1622

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Is it feasible to train Mamba+MoE using torchtitan? #1622

Uh oh!

imhuim982 Aug 22, 2025

Replies: 0 comments

imhuim982
Aug 22, 2025