-
Notifications
You must be signed in to change notification settings - Fork 531
Open
Labels
Description
We would like feedback from the community on this rough plan for Q4. This is of course a work in progress, and we welcome feedback at any time. Please add comments below or on any specific issues. We will edit this description as plans change.
Also, if you're interested in contributing, feel free to dive into any of the unassigned issues!
Soul
These are the broad areas of focus for the quarter. Items in the roadmap below are tagged by “soul item”.
- [Testing]: Better testing and CI infrastructure to prevent build breaks and accuracy issues at the framework level
- [Model Optimization]: DeepSeek-R1, GPT-OSS, Qwen3, Qwen3-Next, MiniCPM4.1-8B, and others
- [API Usability]: API cleanup and refactoring for better user experience
October
- [Model Optimization] DSR1 improvements (details TBD)
- [Model Optimization] Update the routing for TRTLLMGEN to support kimi k2 and qwen #1831
- [Model Optimization] [Feature Request] Gated Delta Net #1690
- [Model Optimization] GPT-OSS perf improvements for max throughput case
- [Model Optimization] Native Sparse Attention
- [Model Optimization] [Feature Request] TopK Sparse Attention #1691
- [API Usability] [Feature Request] "auto" backend for mm_fp4 #1722
- [Testing] Expanded CI coverage per-PR (ability to trigger tests on NVIDIA-internal test infrastructure, including various Blackwell devices)
- [Testing] Initial integration testing: e2e functional sanity checks
- [Testing] Add comprehensive test xfails tracking system and analysis report #1733
- [Model Optimization] Non-gated MoE with squared ReLU activation
- [Model Optimization] [Perf] FP4 MoE on B200 (latency) #1734
- [Model Optimization] [Perf] FP4 GEMM on B200 #1732
- [Model Optimization] Add RoPE, RoPE+Q, RoPE+Q+KVCacheUpdate fused kernels for MLA/GQA/MHA
- [API Usability] Uniform behavior when a (backend, target device, problem shape) is not supported
- [API Usability] More clear specification of SM func and perf support across interfaces/backends.
- [API Usability] refactor: using tvm-ffi for multi-platform bindings #1641
- [API Usability] Unifying quantization related modules (fp4 quantize/quantize)
November
- [Testing] [API Usability] Minimal example of deploying LLM with flashinfer APIs (e.g. through gpt-fast). #1811
- [Model Optimization] Cosmos Reasoning 7B (details TBD)
- [Testing] Improved unit testing based on escape analysis
- [Testing] Improved integration testing based on escape analysis
- [Model Optimization] MXFP4 gemm perf improvements
- [API Usability] Support FP8-qkv FP8/FP4-output trtllm-gen in FlashInfer prefill/decode wrapper
- [API Usability] Unify qk_scale and o_scale Behavior Between trtllm-gen Attention and flashinfer-jit Attention
- [API Usability] Fused MoE general improvements, including (but not limited to):
- [API Usability] Attention API consolidation
- [API Usability] Inaccurate API Docstrings for Attention Prefill #1709
kmrao-nv, Edenzzzz, Fridge003, yzh119, czhu-cohere and 4 moreyiakwy-xpu-ml-framework-teampavanimajety, brayden-hai, kmrao-nv, Fridge003, yzh119 and 3 more