-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Labels
Description
Hybrid Liear Attention
- hybrid_linear_attn attention refactor Remove hybrid_linear_attn attention backend and refactor attention registry #10816
GDN Kernels
Not arch specific
- Optimize Triton GDN decode 1 Optimize GDN decode for Qwen3 Next #17094
- Optimize Triton GDN decode 2 [Qwen3Next] Optimize fused_sigmoid_gating_delta_rule_update_kernel #18271
SM90
- Prefill and decode in Cutlass (from FlashInfer): feat(gdn): add FlashInfer K-last SSM layout support for GDN prefill and decode for Hopper #18361
SM100
- Prefill in Gluon [Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel #17983
- FlashInfer CuteDSL decode kernel [jit-kernel] Add CuTe DSL GDN Decode Kernel #15631
- Another CuteDSL decode kernel [Qwen3-Next] Add cutedsl decode/mtp kernel with transposed ssm_state and prefill gluon kernel for blackwell. #17981
- New faster FlashInfer CuteDSL kernel in FlashInfer Ameyn/gdn decode cutedsl kernel flashinfer-ai/flashinfer#2498
MoE
- bf16 cutlass fused MoE for Hopper Add support for bf16 x bf16 cutlass fused MoE #10275
- BF16 TRTLLM MoE for Blackwell [NVIDIA] Enable TRTLLM BF16 MoE on Blackwell GPUs #13798
- NVFP4 TRTLLM GEN MoE for Blackwell [Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761 #13761
Full Attention
- Enable XQA for SM90 and SM120 Enable XQA for SM90 and SM120 #17115
- TRTLLM_MHA backend for Blackwell Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell #11138
- [Qwen3.5] Set full attn_backend to trtllm_mha on SM100 by default when possible #19030
Communications
- Fix regression caused by enabling symmetric memory: Register tensors with symmetric memory for qwen #18643
NVFP4 kv cache
- FP4 KV Cache for SM120 NVFP4 KV Cache for SM120 #18314
- FP4 KV Cache for SM100 [WIP] FP4 KV Cache on B200 #17733
Runtime
- piecewise cuda graph for TRTLLM-GEN backends for Blackwell Add piecewise cuda graph for Qwen3-Next FP8 flashinfer_trtllm moe backend #18184
- MTP_v2: Add Spec V2 for Qwen 3 next #15591, [Draft][Spec V2] Support specV2 for mamba hybrid attention #18808
Qwen3-Next
- [Qwen3-Next] Enable fused_qkvzba_split_reshape_cat also for prefill #18917
- Huggingface NVFP4 model support [feat] Support nvfp4 quantized model of Qwen3-Next #17627
Qwen3.5
Reactions are currently unavailable