SGLang-Kernel-NPU

Introduction

SGLang-Kernel-NPU is the official kernel library of the SGLang framework for Ascend NPU. It delivers high-performance, production-ready compute primitives optimized for large language model (LLM) inference on Ascend hardware.

The library consists of two main components:

DeepEP-Ascend: Ascend implementation of DeepEP, providing highly optimized Expert Parallelism (EP) communication kernels for Mixture-of-Experts (MoE) models.
SGLang-Kernel-NPU: A comprehensive collection of optimized inference kernels including attention mechanisms, normalization, activation functions, LoRA adapters, and more.

For contribution guidelines, please refer to the Contribution Guide.

Features

DeepEP-Ascend

DeepEP-Ascend provides optimized all-to-all communication kernels for Expert Parallelism in MoE models.

Communication Modes:

Normal Mode: High-throughput dispatch and combine operations for training and prefill phases (up to 65536 tokens/batch for A3 and 8192 tokens/batch for A2)
Low-Latency Mode: Optimized for production inference with small batch sizes (128 tokens/batch), achieving sub-150us latency

Key Capabilities:

Token dispatch and combine with automatic load balancing
Fused MoE computation (fused_deep_moe)
A3 full-mesh HCCS communication and A2 Intranode HCCS + internode RDMA communication
INT8/FP8/BF16 quantization for reduced memory bandwidth

SGLang-Kernel-NPU

SGLang-Kernel-NPU provides a comprehensive set of optimized inference kernels:

Attention:

Multi-Latent Attention (MLA) with Paged KV Cache support
Grouped Query Attention (GQA)
Decode Attention with optimized memory access patterns

Flash Linear Attention (FLA):

Gated Delta Rule implementation
Chunk-based operations for efficient memory usage

Normalization:

RMSNorm
Fused Add + RMSNorm + Bias
Split QKV + RMSNorm + RoPE fusion

Activation Functions:

SwiGLU
Quantized SwiGLU (INT8)

LoRA Adapters:

BGMV expand/shrink
SGMV expand/shrink
SGEMMV expand/shrink

Speculative Decoding:

Efficient tree building
Greedy tree verification

MLA Preprocessing:

End-to-end fusion: RMSNorm → Dequant → MatMul → RoPE → ReshapeAndCache

Mamba Support:

Causal Conv1D for state space models (SSM)

KV Cache Management:

Paged Attention support
Cache location assignment and update

Other Utilities:

Lightning Indexer for sparse Top-K indexing
Triangular matrix inverse
Batch MatMul with transpose

Quick Start

DeepEP-Ascend: Ascend Implementation of DeepEP. README

SGLang-Kernel-NPU: Other SGLang Kernels for Ascend NPU. README

DeepEP-Ascend Performance

Normal Kernels with Pure HCCS

We test normal kernels on A3 384 SuperPOD, following the DeepSeek-V3/R1 pretraining setting (4096 tokens per batch, 7168 hidden size, top-8 experts, INT8 dispatching and BF16 combining).

Type	Dispatch #EP	Bottleneck bandwidth	Combine #EP	Bottleneck bandwidth
Intranode	8	146 GB/s (HCCS)	8	125 GB/s (HCCS)
Intranode	16	107 GB/s (HCCS)	16	103 GB/s (HCCS)
Intranode	32	102 GB/s (HCCS)	32	95 GB/s (HCCS)
Intranode	64	81 GB/s (HCCS)	64	91 GB/s (HCCS)
Intranode	128	57 GB/s (HCCS)	128	81 GB/s (HCCS)

Low-Latency Kernels with Pure HCCS

We test low-latency kernels on A3 384 SuperPOD, following a typical DeepSeek-V3/R1 production setting (128 tokens per batch, 7168 hidden size, top-8 experts, INT8 dispatching and BF16 combining).

Dispatch #EP	Latency	Bandwidth	Combine #EP	Latency	Bandwidth
8	132 us	58 GB/s (HCCS)	8	126 us	116 GB/s (HCCS)
16	139 us	55 GB/s (HCCS)	16	135 us	109 GB/s (HCCS)
32	153 us	49 GB/s (HCCS)	32	151 us	97 GB/s (HCCS)

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.github/workflows		.github/workflows
benchmark		benchmark
cmake		cmake
contrib		contrib
csrc		csrc
docs		docs
include		include
python		python
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
config.ini		config.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGLang-Kernel-NPU

Introduction

Features

DeepEP-Ascend

SGLang-Kernel-NPU

Quick Start

DeepEP-Ascend Performance

Normal Kernels with Pure HCCS

Low-Latency Kernels with Pure HCCS

About

Uh oh!

Releases 29

Packages

Uh oh!

Contributors 43

Uh oh!

Languages

License

sgl-project/sgl-kernel-npu

Folders and files

Latest commit

History

Repository files navigation

SGLang-Kernel-NPU

Introduction

Features

DeepEP-Ascend

SGLang-Kernel-NPU

Quick Start

DeepEP-Ascend Performance

Normal Kernels with Pure HCCS

Low-Latency Kernels with Pure HCCS

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 29

Packages 0

Uh oh!

Contributors 43

Uh oh!

Languages

Packages