What's Changed
- perf: improve sampling/mask/softmax performance (part 1/2) by @yzh119 in #2044
- misc: Add XQA decode to microbenchmark for sm90 and sm120 by @bkryu in #2055
- test: Skip unsupported SM Archs for newly added trtllm MoE test by @bkryu in #2060
- feat: suitable_auto_backends to prune auto backends, bmm_fp8 refactor, heuristic_func intake by @jimmyzho in #2029
- update trtllm cutlass moe by @nv-yunzheq in #2020
- perf: Optimize helper max/minmax function in sampling.cuh by @bkryu in #2058
- [DSV3] Optimized Router Gemm by @nvmbreughe in #2019
- Fix moe fp8 failure for sm121 by @yongwww in #2061
- perf: TRT-LLM MoE Block-FP8 activation optimization by @nekorobov in #2063
- [feat] Refactor trtllmgen MOE and add Bf16 trtllmgen moe by @jiahanc in #2014
- Fix: several bugs/issues with trtllm-gen attention kernels. by @PerkzZheng in #2062
- refactor: remove MetaInfoHash class by @yzh119 in #2064
- chore: Update CODEOWNERS by @flashinfer-bot in #2067
- feat: add xqa mla backend by @qsang-nv in #2053
- Enable renormalize(naive) routing for fp8 per-tensor by @IwakuraRein in #2030
- unittest: improve the efficiency of xqa unittests by @yzh119 in #2075
- minor: canonicalize TFLOPS calculation by @Edenzzzz in #2069
- fix: fix test_trtllm_gen_attention when max_seq_len < page_size by @dongjiyingdjy in #2076
- enable xqa fp8 output by @qsang-nv in #2081
- chore: update requires-python in pyproject.toml by @raayandhar in #2080
- [Test] Optimize test_trtllm_gen_fused_moe.py by @jiahanc in #2072
- test: Change incorrect inputs in test_hopper.py by @bkryu in #2083
- [NVIDIA] Thor & Spark Support by @johnnynunez in #2028
- [API change] deprecate tile_token_dim in trtllm_moe by @jiahanc in #2086
- [Feature] Support batch prefill for POD Attention by @AKKamath in #2079
- Patch sm103 for 3xfp4 moe generation by @aleozlx in #2082
- MNNVL All Reduce for large number of tokens by @nvmbreughe in #2074
- perf: TRT-LLM Gen finalize kernel optimization by @nekorobov in #2092
- refactor: update dpsk fused_moe test [1] by @yyihuang in #2088
- chore: update thor cuda arch (from 110f to 110a) by @yzh119 in #2096
- perf: enable pdl for cutlass fp4 gemm by @yzh119 in #2095
- chore: Update CODEOWNERS by @flashinfer-bot in #2098
- feat: Add flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache (fused RoPE + Q + KV cache, supports MLA/GQA/MHA) by @kahyunnam in #2037
- [API change] Allow using torch.Tensor for scales for trtllm-gen attention by @IwakuraRein in #2084
- refactor: update dpsk fused_moe test [2] by @yyihuang in #2097
- hotfix: rename moe/test_utils.py to moe/utils.py by @yzh119 in #2106
- [DSR1] Added MLA test by @nvmbreughe in #2100
- test: Enable testing for trtllm-gen decode bs1 by @bkryu in #2103
- [DSV3] Optimized routing kernels dsv3 by @nv-yunzheq in #2099
- feature: make the LSE returned by MLA support base 2 or e #2113 by @staugust in #2114
- update xqa license by @qsang-nv in #2117
- add tensor scale input for xqa by @qsang-nv in #2110
- hotfix: add 9.0a to README and installation doc by @yzh119 in #2112
- ci/cd: add nvidia-ml-py to requirments of build-system of flashinfer-cubin by @yzh119 in #2123
New Contributors
- @nekorobov made their first contribution in #2063
- @dongjiyingdjy made their first contribution in #2076
- @johnnynunez made their first contribution in #2028
- @staugust made their first contribution in #2114
Full Changelog: v0.5.2...v0.5.3