Release Release v0.5.3 · flashinfer-ai/flashinfer

What's Changed

perf: improve sampling/mask/softmax performance (part 1/2) by @yzh119 in #2044
misc: Add XQA decode to microbenchmark for sm90 and sm120 by @bkryu in #2055
test: Skip unsupported SM Archs for newly added trtllm MoE test by @bkryu in #2060
feat: suitable_auto_backends to prune auto backends, bmm_fp8 refactor, heuristic_func intake by @jimmyzho in #2029
update trtllm cutlass moe by @nv-yunzheq in #2020
perf: Optimize helper max/minmax function in sampling.cuh by @bkryu in #2058
[DSV3] Optimized Router Gemm by @nvmbreughe in #2019
Fix moe fp8 failure for sm121 by @yongwww in #2061
perf: TRT-LLM MoE Block-FP8 activation optimization by @nekorobov in #2063
[feat] Refactor trtllmgen MOE and add Bf16 trtllmgen moe by @jiahanc in #2014
Fix: several bugs/issues with trtllm-gen attention kernels. by @PerkzZheng in #2062
refactor: remove MetaInfoHash class by @yzh119 in #2064
chore: Update CODEOWNERS by @flashinfer-bot in #2067
feat: add xqa mla backend by @qsang-nv in #2053
Enable renormalize(naive) routing for fp8 per-tensor by @IwakuraRein in #2030
unittest: improve the efficiency of xqa unittests by @yzh119 in #2075
minor: canonicalize TFLOPS calculation by @Edenzzzz in #2069
fix: fix test_trtllm_gen_attention when max_seq_len < page_size by @dongjiyingdjy in #2076
enable xqa fp8 output by @qsang-nv in #2081
chore: update requires-python in pyproject.toml by @raayandhar in #2080
[Test] Optimize test_trtllm_gen_fused_moe.py by @jiahanc in #2072
test: Change incorrect inputs in test_hopper.py by @bkryu in #2083
[NVIDIA] Thor & Spark Support by @johnnynunez in #2028
[API change] deprecate tile_token_dim in trtllm_moe by @jiahanc in #2086
[Feature] Support batch prefill for POD Attention by @AKKamath in #2079
Patch sm103 for 3xfp4 moe generation by @aleozlx in #2082
MNNVL All Reduce for large number of tokens by @nvmbreughe in #2074
perf: TRT-LLM Gen finalize kernel optimization by @nekorobov in #2092
refactor: update dpsk fused_moe test [1] by @yyihuang in #2088
chore: update thor cuda arch (from 110f to 110a) by @yzh119 in #2096
perf: enable pdl for cutlass fp4 gemm by @yzh119 in #2095
chore: Update CODEOWNERS by @flashinfer-bot in #2098
feat: Add flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache (fused RoPE + Q + KV cache, supports MLA/GQA/MHA) by @kahyunnam in #2037
[API change] Allow using torch.Tensor for scales for trtllm-gen attention by @IwakuraRein in #2084
refactor: update dpsk fused_moe test [2] by @yyihuang in #2097
hotfix: rename moe/test_utils.py to moe/utils.py by @yzh119 in #2106
[DSR1] Added MLA test by @nvmbreughe in #2100
test: Enable testing for trtllm-gen decode bs1 by @bkryu in #2103
[DSV3] Optimized routing kernels dsv3 by @nv-yunzheq in #2099
feature: make the LSE returned by MLA support base 2 or e #2113 by @staugust in #2114
update xqa license by @qsang-nv in #2117
add tensor scale input for xqa by @qsang-nv in #2110
hotfix: add 9.0a to README and installation doc by @yzh119 in #2112
ci/cd: add nvidia-ml-py to requirments of build-system of flashinfer-cubin by @yzh119 in #2123

New Contributors

@nekorobov made their first contribution in #2063
@dongjiyingdjy made their first contribution in #2076
@johnnynunez made their first contribution in #2028
@staugust made their first contribution in #2114

Full Changelog: v0.5.2...v0.5.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.5.3

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!