Releases · sgl-project/sgl-kernel-npu

15 Feb 17:00

ping1jing2

2026.02.01.post2

b7e88d6

2026.02.01.post2 Pre-release

Pre-release

What's Changed

Build the deepep package with the chip model included. by @oagniqgnat in #274
fix:buffer control by @Yael-X in #361
Revert " Build the deepep package with the chip model included." by @kaniel-outis in #363
reset ci -- run test mixed running for experts on a2. by @zhuyutong332 in #365
adapt ant moving to A2 single machine by @luanyundu in #362
Fix the bug that total expert num greater than 256 or local expert num is less than 8 by @luanyundu in #364
CI execution requirements for separating a2 and a3 by @zhuyutong332 in #367
support qwen3.5 by @chenxu214 in #377
Update layernorm_gated.py by @chenxu214 in #378

New Contributors

@chenxu214 made their first contribution in #377

Full Changelog: 2026.02.01...2026.02.01.post2

Contributors

kaniel-outis, luanyundu, and 4 other contributors

Assets 10

11 Feb 15:01

iforgetmyname

2026.02.01.post1

c726cd8

2026.02.01.post1 Pre-release

Pre-release

What's Changed

Build the deepep package with the chip model included. by @oagniqgnat in #274
fix:buffer control by @Yael-X in #361
Revert " Build the deepep package with the chip model included." by @kaniel-outis in #363
reset ci -- run test mixed running for experts on a2. by @zhuyutong332 in #365
adapt ant moving to A2 single machine by @luanyundu in #362
Fix the bug that total expert num greater than 256 or local expert num is less than 8 by @luanyundu in #364
CI execution requirements for separating a2 and a3 by @zhuyutong332 in #367
【WIP】GLM by @cen121212 in #373

New Contributors

@cen121212 made their first contribution in #373

Full Changelog: 2026.02.01...2026.02.01.post1

Contributors

kaniel-outis, luanyundu, and 4 other contributors

Assets 10

02 Feb 19:17

iforgetmyname

2026.02.01

ba46a30

2026.02.01 Latest

Latest

What's Changed

Fix notify dispatch cann8.3 by @oagniqgnat in #245
fix normal and low_latency layerd rdma_data_size when mixed running by @zuje123 in #246
fixing release ci by @BourneSun0527 in #248
add a script for generalize test by @goosj in #131
sgl-kernel-npu add release version by @iforgetmyname in #253
modify md file by @BourneSun0527 in #255
add long sequence feature for normal deep_ep by @oagniqgnat in #254
[fix] fixup bug in conv1d_update_fn by @zhuyijie88 in #259
qwen3-next op optimize by @shengzhaotian in #257
[Bugfix] fix TorchNpuHelper rename bugs by @ltcs11 in #265
Fixing Chinese character encoding issues by @oagniqgnat in #275
Add the long-sequence ant migration feature for the prefill combine operator. by @oagniqgnat in #267
add deepep a2 doc by @zuje123 in #277
prepare build and release for a2 by @iforgetmyname in #273
fix a2 deepep doc by @zuje123 in #279
Fix the issue of HCCL buffer tiling verification failure during one round of testing. by @oagniqgnat in #280
l2 norm const parameter change by @shengzhaotian in #276
bump version to 2025.12.25 by @iforgetmyname in #281
modify split_qkv_rmsnorm_rope by @Liwansi in #282
Fix the performance degradation issue of the single-wheel operation in Ant Moving. by @oagniqgnat in #287
Optimize prepare_lens by removing device transfer by @shengzhaotian in #289
split_qkv_rmsnorm_rope bugfix by @Liwansi in #290
fix notify magic auto-increment bug by @zuje123 in #291
Resolving the UB out-of-bounds issue caused by A2 dual-machine mixed operation by @oagniqgnat in #288
fix a2 single combine aclnn params by @zuje123 in #292
LoRA: Optimization LoRA kernels and refactoring by @vlserov in #284
Support build with cann 8.5 by @BourneSun0527 in #283
Added an environment variable to control whether to enable the Combine Ant Migration feature. by @oagniqgnat in #304
Supplement A2 doc, software and hardware compatibility info by @zuje123 in #294
fix layout numTokensPerExpertTensor partial Initialization bug by @zuje123 in #303
optimize gdn gating and fused_qkvzba_split_reshape_cat by @RuixuanZhang06 in #306
support add_gemma_rms_norm by @RuixuanZhang06 in #310
feat:add performance compare by @Yael-X in #311
[chore] version bump to 2026.01.12 by @iforgetmyname in #312
Add swiglu_oai_triton for GPTOSS by @Todobe in #270
Optimize sinks attention for prefix cache by @Todobe in #260
fix little batchsize and int8 quant on ci by @zhuyutong332 in #302
fix bmm transpose in cann 8.5 by @randgun in #316
Modify contribution guide by @BourneSun0527 in #315
Integrate ccache for faster compilation by @randgun in #318
add dfx for operator FusedDeepMoe by @wangyibo1005 in #317
[Chore] CANN version bump to 8.5.0 by @iforgetmyname in #326
Deepep adapt custom cann installation path by @BourneSun0527 in #327
Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer by @oagniqgnat in #314
remove the limit that A2 internode only support topk 8 by @luanyundu in #323
add deepep normal api doc by @zuje123 in #336
【Doc】add fused deep moe doc by @kaniel-outis in #335
Document get_dispatch_layout API by @luanyundu in #338
Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. by @oagniqgnat in #330
Added the low_latency operator API documentation. by @oagniqgnat in #337
The environment variable DEEPEP_HCCL_BUFFSIZE is added by @zzx-study in #329
chunk_gated_delta_rule_npu output final state by @RuixuanZhang06 in #341
support the situation that topk maybe -1 on machine A3 by @luanyundu in #313
Add AscendC triangular inverse by @zouzias in #332
(test) add solve_tril from upstream by @zouzias in #339
[Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI by @Mitchell-xiyunfeng in #345
add function for deep-ep tests by @zhuyutong332 in #301
Support x86_64 and aarch64 binary release by @iforgetmyname in #325
Add scripts for building CMake files by @1329009851 in #344
Revert "Add scripts for building CMake files" by @1329009851 in #353
Modify the description of DeepEP in the README file. by @oagniqgnat in #348
[Bugfix] Fix build script working with cann 8.5.0 by @iforgetmyname in #354
fix the hanging bug by @luanyundu in #355
Modify notifydispatch to support DEEPEP_NORMAL_LONG_SEQ_ROUND up to 128. by @WSEmma in #352
release follows naming convention by @iforgetmyname in #356
Cover the workflows cases on a3 by @zhuyutong332 in #321
[Chore] Bump sgl-kernel-npu version to 2026.02.01 by @iforgetmyname in #359
[Bugfix] Fix mismatched package building directory by @iforgetmyname in #360

New Contributors

@zhuyijie88 made their first contribution in #259
@zhuyutong332 made their first contribution in #302
@zzx-study made their first contribution in #329
@zouzias made their first contribution in #332
@Mitchell-xiyunfeng made their first contribution in #345
@1329009851 made their first contribution in #344
@WSEmma made their first contribution in #352

Full Changelog: 2025120...2026.02.01

Contributors

zouzias, ltcs11, and 21 other contributors

Assets 10

31 Jan 07:59

iforgetmyname

2026.02.01.rc1

da4ec43

2026.02.01.rc1 Pre-release

Pre-release

What's Changed

Add scripts for building CMake files by @1329009851 in #344
Revert "Add scripts for building CMake files" by @1329009851 in #353
Modify the description of DeepEP in the README file. by @oagniqgnat in #348
[Bugfix] Fix build script working with cann 8.5.0 by @iforgetmyname in #354
fix the hanging bug by @luanyundu in #355
Modify notifydispatch to support DEEPEP_NORMAL_LONG_SEQ_ROUND up to 128. by @WSEmma in #352
release follows naming convention by @iforgetmyname in #356

New Contributors

@1329009851 made their first contribution in #344
@WSEmma made their first contribution in #352

Full Changelog: 2026.01.28...2026.02.01.rc1

Contributors

iforgetmyname, WSEmma, and 3 other contributors

Assets 2

28 Jan 03:29

iforgetmyname

2026.01.28

2c77463

2026.01.28 Pre-release

Pre-release

What's Changed

Added the low_latency operator API documentation. by @oagniqgnat in #337
The environment variable DEEPEP_HCCL_BUFFSIZE is added by @zzx-study in #329
chunk_gated_delta_rule_npu output final state by @RuixuanZhang06 in #341
support the situation that topk maybe -1 on machine A3 by @luanyundu in #313
Add AscendC triangular inverse by @zouzias in #332
(test) add solve_tril from upstream by @zouzias in #339
[Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI by @Mitchell-xiyunfeng in #345
add function for deep-ep tests by @zhuyutong332 in #301
Support x86_64 and aarch64 binary release by @iforgetmyname in #325

New Contributors

@zzx-study made their first contribution in #329
@zouzias made their first contribution in #332
@Mitchell-xiyunfeng made their first contribution in #345

Full Changelog: 2026.01.21...2026.01.28

Contributors

zouzias, iforgetmyname, and 6 other contributors

Assets 10

21 Jan 08:06

iforgetmyname

2026.01.21

46b73de

2026.01.21 Pre-release

Pre-release

Added the verification of num_max_dispatch_tokens_per_rank to the dec…

Assets 4

19 Jan 04:00

iforgetmyname

2026.01.19

38ad69d

2026.01.19 Pre-release

Pre-release

What's Changed

Add swiglu_oai_triton for GPTOSS by @Todobe in #270
Optimize sinks attention for prefix cache by @Todobe in #260
fix little batchsize and int8 quant on ci by @zhuyutong332 in #302
fix bmm transpose in cann 8.5 by @randgun in #316
Modify contribution guide by @BourneSun0527 in #315
Integrate ccache for faster compilation by @randgun in #318
add dfx for operator FusedDeepMoe by @wangyibo1005 in #317
[Chore] CANN version bump to 8.5.0 by @iforgetmyname in #326

New Contributors

@zhuyutong332 made their first contribution in #302

Full Changelog: 2026.01.12...2026.01.19

Contributors

iforgetmyname, Todobe, and 4 other contributors

Assets 4

12 Jan 11:22

iforgetmyname

2026.01.12

25542f2

2026.01.12 Pre-release

Pre-release

What's Changed

support add_gemma_rms_norm by @RuixuanZhang06 in #310
feat:add performance compare by @Yael-X in #311
[chore] version bump to 2026.01.12 by @iforgetmyname in #312

Full Changelog: 2026.01.09...2026.01.12

Contributors

iforgetmyname, RuixuanZhang06, and Yael-X

Assets 4

09 Jan 02:59

iforgetmyname

2026.01.09

ea4949d

2026.01.09 Pre-release

Pre-release

What's Changed

optimize gdn gating and fused_qkvzba_split_reshape_cat by @RuixuanZhang06 in #306

Full Changelog: 2026.01.07...2026.01.09

Contributors

RuixuanZhang06

Assets 4

07 Jan 08:09

iforgetmyname

2026.01.07

bacee3f

2026.01.07 Pre-release

Pre-release

What's Changed

LoRA: Optimization LoRA kernels and refactoring by @vlserov in #284
Support build with cann 8.5 by @BourneSun0527 in #283
Added an environment variable to control whether to enable the Combine Ant Migration feature. by @oagniqgnat in #304
Supplement A2 doc, software and hardware compatibility info by @zuje123 in #294
fix layout numTokensPerExpertTensor partial Initialization bug by @zuje123 in #303

Full Changelog: 2025.12.31...2026.01.07

Contributors

oagniqgnat, vlserov, and 2 other contributors

Assets 4

Releases: sgl-project/sgl-kernel-npu

2026.02.01.post2

What's Changed

New Contributors

Contributors

Uh oh!

2026.02.01.post1

What's Changed

New Contributors

Contributors

Uh oh!

2026.02.01

What's Changed

New Contributors

Contributors

Uh oh!

2026.02.01.rc1

What's Changed

New Contributors

Contributors

Uh oh!

2026.01.28

What's Changed

New Contributors

Contributors

Uh oh!

2026.01.21

Uh oh!

2026.01.19

What's Changed

New Contributors

Contributors

Uh oh!

2026.01.12

What's Changed

Contributors

Uh oh!

2026.01.09

What's Changed

Contributors

Uh oh!

2026.01.07

What's Changed

Contributors

Uh oh!