Releases: sgl-project/sgl-kernel-npu
Releases · sgl-project/sgl-kernel-npu
2026.02.01.post2
What's Changed
- Build the deepep package with the chip model included. by @oagniqgnat in #274
- fix:buffer control by @Yael-X in #361
- Revert " Build the deepep package with the chip model included." by @kaniel-outis in #363
- reset ci -- run test mixed running for experts on a2. by @zhuyutong332 in #365
- adapt ant moving to A2 single machine by @luanyundu in #362
- Fix the bug that total expert num greater than 256 or local expert num is less than 8 by @luanyundu in #364
- CI execution requirements for separating a2 and a3 by @zhuyutong332 in #367
- support qwen3.5 by @chenxu214 in #377
- Update layernorm_gated.py by @chenxu214 in #378
New Contributors
- @chenxu214 made their first contribution in #377
Full Changelog: 2026.02.01...2026.02.01.post2
2026.02.01.post1
What's Changed
- Build the deepep package with the chip model included. by @oagniqgnat in #274
- fix:buffer control by @Yael-X in #361
- Revert " Build the deepep package with the chip model included." by @kaniel-outis in #363
- reset ci -- run test mixed running for experts on a2. by @zhuyutong332 in #365
- adapt ant moving to A2 single machine by @luanyundu in #362
- Fix the bug that total expert num greater than 256 or local expert num is less than 8 by @luanyundu in #364
- CI execution requirements for separating a2 and a3 by @zhuyutong332 in #367
- 【WIP】GLM by @cen121212 in #373
New Contributors
- @cen121212 made their first contribution in #373
Full Changelog: 2026.02.01...2026.02.01.post1
2026.02.01
What's Changed
- Fix notify dispatch cann8.3 by @oagniqgnat in #245
- fix normal and low_latency layerd rdma_data_size when mixed running by @zuje123 in #246
- fixing release ci by @BourneSun0527 in #248
- add a script for generalize test by @goosj in #131
- sgl-kernel-npu add release version by @iforgetmyname in #253
- modify md file by @BourneSun0527 in #255
- add long sequence feature for normal deep_ep by @oagniqgnat in #254
- [fix] fixup bug in conv1d_update_fn by @zhuyijie88 in #259
- qwen3-next op optimize by @shengzhaotian in #257
- [Bugfix] fix TorchNpuHelper rename bugs by @ltcs11 in #265
- Fixing Chinese character encoding issues by @oagniqgnat in #275
- Add the long-sequence ant migration feature for the prefill combine operator. by @oagniqgnat in #267
- add deepep a2 doc by @zuje123 in #277
- prepare build and release for a2 by @iforgetmyname in #273
- fix a2 deepep doc by @zuje123 in #279
- Fix the issue of HCCL buffer tiling verification failure during one round of testing. by @oagniqgnat in #280
- l2 norm const parameter change by @shengzhaotian in #276
- bump version to 2025.12.25 by @iforgetmyname in #281
- modify split_qkv_rmsnorm_rope by @Liwansi in #282
- Fix the performance degradation issue of the single-wheel operation in Ant Moving. by @oagniqgnat in #287
- Optimize prepare_lens by removing device transfer by @shengzhaotian in #289
- split_qkv_rmsnorm_rope bugfix by @Liwansi in #290
- fix notify magic auto-increment bug by @zuje123 in #291
- Resolving the UB out-of-bounds issue caused by A2 dual-machine mixed operation by @oagniqgnat in #288
- fix a2 single combine aclnn params by @zuje123 in #292
- LoRA: Optimization LoRA kernels and refactoring by @vlserov in #284
- Support build with cann 8.5 by @BourneSun0527 in #283
- Added an environment variable to control whether to enable the Combine Ant Migration feature. by @oagniqgnat in #304
- Supplement A2 doc, software and hardware compatibility info by @zuje123 in #294
- fix layout numTokensPerExpertTensor partial Initialization bug by @zuje123 in #303
- optimize gdn gating and fused_qkvzba_split_reshape_cat by @RuixuanZhang06 in #306
- support add_gemma_rms_norm by @RuixuanZhang06 in #310
- feat:add performance compare by @Yael-X in #311
- [chore] version bump to 2026.01.12 by @iforgetmyname in #312
- Add swiglu_oai_triton for GPTOSS by @Todobe in #270
- Optimize sinks attention for prefix cache by @Todobe in #260
- fix little batchsize and int8 quant on ci by @zhuyutong332 in #302
- fix bmm transpose in cann 8.5 by @randgun in #316
- Modify contribution guide by @BourneSun0527 in #315
- Integrate ccache for faster compilation by @randgun in #318
- add dfx for operator FusedDeepMoe by @wangyibo1005 in #317
- [Chore] CANN version bump to 8.5.0 by @iforgetmyname in #326
- Deepep adapt custom cann installation path by @BourneSun0527 in #327
- Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer by @oagniqgnat in #314
- remove the limit that A2 internode only support topk 8 by @luanyundu in #323
- add deepep normal api doc by @zuje123 in #336
- 【Doc】add fused deep moe doc by @kaniel-outis in #335
- Document get_dispatch_layout API by @luanyundu in #338
- Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. by @oagniqgnat in #330
- Added the low_latency operator API documentation. by @oagniqgnat in #337
- The environment variable DEEPEP_HCCL_BUFFSIZE is added by @zzx-study in #329
- chunk_gated_delta_rule_npu output final state by @RuixuanZhang06 in #341
- support the situation that topk maybe -1 on machine A3 by @luanyundu in #313
- Add AscendC triangular inverse by @zouzias in #332
- (test) add solve_tril from upstream by @zouzias in #339
- [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI by @Mitchell-xiyunfeng in #345
- add function for deep-ep tests by @zhuyutong332 in #301
- Support x86_64 and aarch64 binary release by @iforgetmyname in #325
- Add scripts for building CMake files by @1329009851 in #344
- Revert "Add scripts for building CMake files" by @1329009851 in #353
- Modify the description of DeepEP in the README file. by @oagniqgnat in #348
- [Bugfix] Fix build script working with cann 8.5.0 by @iforgetmyname in #354
- fix the hanging bug by @luanyundu in #355
- Modify notifydispatch to support DEEPEP_NORMAL_LONG_SEQ_ROUND up to 128. by @WSEmma in #352
- release follows naming convention by @iforgetmyname in #356
- Cover the workflows cases on a3 by @zhuyutong332 in #321
- [Chore] Bump sgl-kernel-npu version to 2026.02.01 by @iforgetmyname in #359
- [Bugfix] Fix mismatched package building directory by @iforgetmyname in #360
New Contributors
- @zhuyijie88 made their first contribution in #259
- @zhuyutong332 made their first contribution in #302
- @zzx-study made their first contribution in #329
- @zouzias made their first contribution in #332
- @Mitchell-xiyunfeng made their first contribution in #345
- @1329009851 made their first contribution in #344
- @WSEmma made their first contribution in #352
Full Changelog: 2025120...2026.02.01
2026.02.01.rc1
What's Changed
- Add scripts for building CMake files by @1329009851 in #344
- Revert "Add scripts for building CMake files" by @1329009851 in #353
- Modify the description of DeepEP in the README file. by @oagniqgnat in #348
- [Bugfix] Fix build script working with cann 8.5.0 by @iforgetmyname in #354
- fix the hanging bug by @luanyundu in #355
- Modify notifydispatch to support DEEPEP_NORMAL_LONG_SEQ_ROUND up to 128. by @WSEmma in #352
- release follows naming convention by @iforgetmyname in #356
New Contributors
- @1329009851 made their first contribution in #344
- @WSEmma made their first contribution in #352
Full Changelog: 2026.01.28...2026.02.01.rc1
2026.01.28
What's Changed
- Added the low_latency operator API documentation. by @oagniqgnat in #337
- The environment variable DEEPEP_HCCL_BUFFSIZE is added by @zzx-study in #329
- chunk_gated_delta_rule_npu output final state by @RuixuanZhang06 in #341
- support the situation that topk maybe -1 on machine A3 by @luanyundu in #313
- Add AscendC triangular inverse by @zouzias in #332
- (test) add solve_tril from upstream by @zouzias in #339
- [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI by @Mitchell-xiyunfeng in #345
- add function for deep-ep tests by @zhuyutong332 in #301
- Support x86_64 and aarch64 binary release by @iforgetmyname in #325
New Contributors
- @zzx-study made their first contribution in #329
- @zouzias made their first contribution in #332
- @Mitchell-xiyunfeng made their first contribution in #345
Full Changelog: 2026.01.21...2026.01.28
2026.01.21
Added the verification of num_max_dispatch_tokens_per_rank to the dec…
2026.01.19
What's Changed
- Add swiglu_oai_triton for GPTOSS by @Todobe in #270
- Optimize sinks attention for prefix cache by @Todobe in #260
- fix little batchsize and int8 quant on ci by @zhuyutong332 in #302
- fix bmm transpose in cann 8.5 by @randgun in #316
- Modify contribution guide by @BourneSun0527 in #315
- Integrate ccache for faster compilation by @randgun in #318
- add dfx for operator FusedDeepMoe by @wangyibo1005 in #317
- [Chore] CANN version bump to 8.5.0 by @iforgetmyname in #326
New Contributors
- @zhuyutong332 made their first contribution in #302
Full Changelog: 2026.01.12...2026.01.19
2026.01.12
What's Changed
- support add_gemma_rms_norm by @RuixuanZhang06 in #310
- feat:add performance compare by @Yael-X in #311
- [chore] version bump to 2026.01.12 by @iforgetmyname in #312
Full Changelog: 2026.01.09...2026.01.12
2026.01.09
What's Changed
- optimize gdn gating and fused_qkvzba_split_reshape_cat by @RuixuanZhang06 in #306
Full Changelog: 2026.01.07...2026.01.09
2026.01.07
What's Changed
- LoRA: Optimization LoRA kernels and refactoring by @vlserov in #284
- Support build with cann 8.5 by @BourneSun0527 in #283
- Added an environment variable to control whether to enable the Combine Ant Migration feature. by @oagniqgnat in #304
- Supplement A2 doc, software and hardware compatibility info by @zuje123 in #294
- fix layout numTokensPerExpertTensor partial Initialization bug by @zuje123 in #303
Full Changelog: 2025.12.31...2026.01.07