Skip to content

Commit eebe663

Browse files
Merge branch 'main' into feat/rollout-logprob-support
2 parents a0d4a56 + 35ef38c commit eebe663

File tree

553 files changed

+48471
-19391
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

553 files changed

+48471
-19391
lines changed

.claude/skills/add-jit-kernel/SKILL.md

Lines changed: 440 additions & 157 deletions
Large diffs are not rendered by default.

.claude/skills/add-sgl-kernel/SKILL.md

Lines changed: 224 additions & 83 deletions
Large diffs are not rendered by default.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
name: use-efficient-diffusion-kernels
3+
description: Guidance for using SGLang Diffusion fused kernels and fast CUDA paths. Use when mapping fusion patterns in diffusion inference, choosing fused ops or attention backends, handling RoPE/QK norm performance pitfalls, or integrating new diffusion models with kernel-aware constraints.
4+
---
5+
6+
# Use Efficient Diffusion Kernels
7+
8+
**Overview**
9+
This skill focuses on SGLang Diffusion (`sglang.multimodal_gen`) kernel fusion patterns and fast CUDA paths. Prefer existing fused ops (Triton, CuTe DSL, sgl-kernel). Make constraints and fallbacks explicit.
10+
11+
**Key Files**
12+
- `python/sglang/multimodal_gen/runtime/layers/layernorm.py`
13+
- `python/sglang/multimodal_gen/runtime/layers/elementwise.py`
14+
- `python/sglang/multimodal_gen/runtime/layers/rotary_embedding/utils.py`
15+
- `python/sglang/jit_kernel/diffusion/triton/scale_shift.py`
16+
- `python/sglang/jit_kernel/diffusion/triton/norm.py`
17+
- `python/sglang/jit_kernel/diffusion/triton/rmsnorm_onepass.py`
18+
- `python/sglang/jit_kernel/diffusion/triton/rotary.py`
19+
- `python/sglang/jit_kernel/diffusion/cutedsl/scale_residual_norm_scale_shift.py`
20+
- `python/sglang/jit_kernel/norm.py`
21+
- `python/sglang/multimodal_gen/runtime/platforms/cuda.py`
22+
- `python/sglang/multimodal_gen/runtime/layers/attention/selector.py`
23+
- `docs/diffusion/performance/attention_backends.md`
24+
25+
**Core Fusion Patterns**
26+
27+
1. Scale/Shift elementwise fusion (AdaLN modulation)
28+
- Kernels: `fuse_scale_shift_kernel`, `fuse_scale_shift_gate_select01_kernel`
29+
- Locations: `elementwise.py`, `layernorm.py`, `qwen_image.py`, `triton/scale_shift.py`
30+
- Use cases: `x * (1 + scale) + shift` and `a * (k + b) + c`
31+
- Constraints: `x` must be CUDA and contiguous. `scale/shift` support 0D/1D/2D/3D/4D broadcast. 4D `[B, F, 1, C]` requires `L % F == 0`.
32+
- NPU fallback: `scale_shift.py` swaps to `npu_fallback` native path.
33+
34+
2. Norm + Scale/Shift fusion (CuTe DSL)
35+
- Kernels: `fused_norm_scale_shift`, `fused_scale_residual_norm_scale_shift`
36+
- Locations: `layernorm.py`, `cutedsl/scale_residual_norm_scale_shift.py`
37+
- Use cases:
38+
- `y = norm(x) * (1 + scale) + shift`
39+
- `y = norm(residual + gate * x) * (1 + scale) + shift`
40+
- Constraints: `D % 256 == 0` and `D <= 8192`. `x/residual/gate/scale/shift` must pass shape and stride validation. Dtypes limited to fp16/bf16/fp32.
41+
- Behavior: CuTe DSL compilation cached by `(dtype, ndim, D, norm_type)`. `None` tensors replaced by scalar placeholders. If constraints fail, `layernorm.py` warns and falls back to native PyTorch.
42+
43+
3. Triton LayerNorm/RMSNorm fusion
44+
- Kernels: `rms_norm_fn`, `layer_norm_fn`, `norm_infer`
45+
- Locations: `triton/norm.py`, `layernorm.py`
46+
- Use cases: fp32 RMSNorm with residual/dropout/rowscale/x1 branches, and inference-friendly `norm_infer`.
47+
- Constraints: last dim must be contiguous, and `N * element_size < 64KB`.
48+
49+
4. Triton one-pass RMSNorm (small hidden size fast path)
50+
- Kernel: `triton_one_pass_rms_norm`
51+
- Locations: `triton/rmsnorm_onepass.py`, `layernorm.py`
52+
- Use case: `hidden_size <= 128` in `RMSNorm.forward_cuda`.
53+
54+
5. Triton RoPE fusion
55+
- Kernel: `apply_rotary_embedding`
56+
- Locations: `triton/rotary.py`, `rotary_embedding/utils.py`
57+
- Use case: GPT-J style RoPE when not Neox.
58+
- Constraints: `head_size` must be even.
59+
- NPU fallback: `npu_fallback.apply_rotary_embedding_native`.
60+
61+
**Faster CUDA Kernel Usage Points**
62+
63+
1. sgl-kernel RMSNorm and fused add RMSNorm
64+
- Location: `layernorm.py`
65+
- Behavior: CUDA uses `sgl_kernel.fused_add_rmsnorm` and `sgl_kernel.rmsnorm`. `hidden_size <= 128` uses Triton one-pass. ROCm falls back to native.
66+
67+
2. Attention backend selection (FlashAttention, Sage, SDPA)
68+
- Locations: `platforms/cuda.py`, `attention/selector.py`, `docs/diffusion/performance/attention_backends.md`
69+
- Behavior: CUDA prefers FlashAttention (FA3/FA4) when supported, otherwise Torch SDPA. Force via `--attention-backend` or `global_force_attn_backend`.
70+
71+
3. FlashInfer RoPE (Q/K inplace)
72+
- Location: `rotary_embedding/utils.py`
73+
- Behavior: `flashinfer.rope.apply_rope_with_cos_sin_cache_inplace` when available, otherwise Triton RoPE fallback.
74+
75+
**QK Norm Optimization**
76+
77+
- Entry point: `apply_qk_norm` in `layernorm.py`.
78+
- Fast path: JIT fused inplace QK norm from `python/sglang/jit_kernel/norm.py` via `fused_inplace_qknorm`.
79+
- Preconditions for fused path:
80+
- CUDA only.
81+
- `allow_inplace=True` and `q_eps == k_eps`.
82+
- `can_use_fused_inplace_qknorm(head_dim, dtype)` returns true.
83+
- Supported head dims: `64, 128, 256, 512, 1024`.
84+
- Behavior: Fused path operates on `q` and `k` in place after reshaping to `[B, -1, head_dim]`. If preconditions fail, fall back to per-tensor RMSNorm.
85+
86+
**Common Entry Points in Diffusion Models**
87+
- AdaLN modulation: `LayerNormScaleShift`, `RMSNormScaleShift`, `ScaleResidual*` in `layernorm.py`.
88+
- Qwen-Image gating: `fuse_scale_shift_gate_select01_kernel` in `qwen_image.py`.
89+
- QK norm: `apply_qk_norm` used in `flux.py`, `flux_2.py`, `qwen_image.py`, `zimage.py`, `wanvideo.py`, `ltx_2.py`, `hunyuanvideo.py`.
90+
- RoPE: `_apply_rotary_emb` prefers Triton; Q/K RoPE prefers FlashInfer when present.
91+
92+
**Constraints and Fallbacks**
93+
- `scale_shift` Triton requires CUDA + contiguous `x`. NPU swaps to native.
94+
- CuTe DSL fused norms require `D % 256 == 0` and `D <= 8192`.
95+
- Triton norm kernels error on feature size >= 64KB.
96+
- FlashAttention requires fp16/bf16 and SM80+; otherwise SDPA.
97+
98+
**Integration Checklist for New Models**
99+
100+
1. Reuse `LayerNormScaleShift` or `ScaleResidual*` modules instead of re-implementing fusion logic.
101+
2. Keep tensors contiguous and satisfy D alignment (`% 256`) and size (`<= 8192`) for CuTe fused paths.
102+
3. Use `fuse_scale_shift_kernel` for AdaLN modulation and keep a PyTorch fallback.
103+
4. Use `apply_qk_norm` and ensure head_dim is in the supported list for fused QK norm.
104+
5. If using FlashInfer RoPE, avoid `pack qkv` and ensure Q/K are contiguous.
105+
6. For attention, follow `selector.py` priority; override with CLI only if needed.
106+
107+
**When Extending or Modifying Kernels**
108+
- Add `torch.library.custom_op` and `register_fake` for compile and meta support.
109+
- Keep CuTe compile cache keys aligned to `(dtype, ndim, D)`.
110+
- Avoid implicit broadcasts that force hidden `contiguous()` copies.
111+
- Preserve NPU and ROCm fallback paths.
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
---
2+
name: write-sglang-test
3+
description: Guide for writing SGLang CI/UT tests following project conventions. Covers CustomTestCase, CI registration, server fixtures, model selection, and test placement. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.
4+
---
5+
6+
# Writing SGLang CI / UT Tests
7+
8+
## Core Rules
9+
10+
1. **Always use `CustomTestCase`** — never raw `unittest.TestCase`
11+
2. **Place tests in `test/registered/<category>/`** — only use `test/manual/` for debugging / non-CI tests
12+
3. **Reuse server fixtures** — inherit from `DefaultServerBase` or write `setUpClass`/`tearDownClass` with `popen_launch_server`
13+
4. **Smallest model for model-agnostic functionality** — use `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (Llama-3.2-1B-Instruct) for basic features that don't depend on model size
14+
5. **8B for general performance** — use `DEFAULT_MODEL_NAME_FOR_TEST` (Llama-3.1-8B-Instruct, single-node) for performance tests that don't involve spec / DP / parallelism
15+
6. **Bigger features → discuss case by case** — spec, DP attention, tensor/pipeline parallelism etc. may need multi-GPU suites and specific models
16+
17+
---
18+
19+
## Test File Template
20+
21+
### Functional correctness test (small model)
22+
23+
```python
24+
import unittest
25+
26+
import requests
27+
28+
from sglang.srt.utils import kill_process_tree
29+
from sglang.test.ci.ci_register import register_cuda_ci
30+
from sglang.test.test_utils import (
31+
DEFAULT_SMALL_MODEL_NAME_FOR_TEST,
32+
DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
33+
DEFAULT_URL_FOR_TEST,
34+
CustomTestCase,
35+
popen_launch_server,
36+
)
37+
38+
register_cuda_ci(est_time=60, suite="stage-b-test-small-1-gpu")
39+
40+
41+
class TestMyFeature(CustomTestCase):
42+
@classmethod
43+
def setUpClass(cls):
44+
cls.model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
45+
cls.base_url = DEFAULT_URL_FOR_TEST
46+
cls.process = popen_launch_server(
47+
cls.model,
48+
cls.base_url,
49+
timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
50+
other_args=["--arg1", "value1"], # feature-specific args
51+
)
52+
53+
@classmethod
54+
def tearDownClass(cls):
55+
kill_process_tree(cls.process.pid)
56+
57+
def test_basic_functionality(self):
58+
response = requests.post(
59+
self.base_url + "/generate",
60+
json={"text": "Hello", "sampling_params": {"max_new_tokens": 32}},
61+
)
62+
self.assertEqual(response.status_code, 200)
63+
64+
65+
if __name__ == "__main__":
66+
unittest.main(verbosity=3)
67+
```
68+
69+
### General performance test (8B model, single node, no spec/DP/parallelism)
70+
71+
```python
72+
import time
73+
import unittest
74+
75+
import requests
76+
77+
from sglang.srt.utils import kill_process_tree
78+
from sglang.test.ci.ci_register import register_cuda_ci
79+
from sglang.test.test_utils import (
80+
DEFAULT_MODEL_NAME_FOR_TEST,
81+
DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
82+
DEFAULT_URL_FOR_TEST,
83+
CustomTestCase,
84+
popen_launch_server,
85+
)
86+
87+
register_cuda_ci(est_time=300, suite="stage-b-test-large-1-gpu")
88+
89+
90+
class TestMyFeaturePerf(CustomTestCase):
91+
@classmethod
92+
def setUpClass(cls):
93+
cls.model = DEFAULT_MODEL_NAME_FOR_TEST
94+
cls.base_url = DEFAULT_URL_FOR_TEST
95+
cls.process = popen_launch_server(
96+
cls.model,
97+
cls.base_url,
98+
timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
99+
)
100+
101+
@classmethod
102+
def tearDownClass(cls):
103+
kill_process_tree(cls.process.pid)
104+
105+
def test_latency(self):
106+
start = time.perf_counter()
107+
response = requests.post(
108+
self.base_url + "/generate",
109+
json={"text": "Hello", "sampling_params": {"max_new_tokens": 128}},
110+
)
111+
elapsed = time.perf_counter() - start
112+
self.assertEqual(response.status_code, 200)
113+
self.assertLess(elapsed, 5.0, "Latency exceeded threshold")
114+
115+
116+
if __name__ == "__main__":
117+
unittest.main(verbosity=3)
118+
```
119+
120+
---
121+
122+
## Server Fixture Reuse
123+
124+
For tests that only need a standard server, inherit from `DefaultServerBase` and override class attributes:
125+
126+
```python
127+
from sglang.test.server_fixtures.default_fixture import DefaultServerBase
128+
129+
class TestMyFeature(DefaultServerBase):
130+
model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
131+
other_args = ["--enable-my-feature"]
132+
133+
def test_something(self):
134+
...
135+
```
136+
137+
Available fixtures in `python/sglang/test/server_fixtures/`:
138+
139+
| Fixture | Use case |
140+
|---------|----------|
141+
| `DefaultServerBase` | Standard single-server tests |
142+
| `EagleServerBase` | EAGLE speculative decoding |
143+
| `PDDisaggregationServerBase` | Disaggregated prefill/decode |
144+
| `MMMUServerBase` | Multimodal VLM tests |
145+
146+
---
147+
148+
## CI Registration
149+
150+
Every test file in `test/registered/` **must** call a registration function at module level:
151+
152+
```python
153+
from sglang.test.ci.ci_register import register_cuda_ci, register_amd_ci
154+
155+
register_cuda_ci(est_time=60, suite="stage-b-test-small-1-gpu")
156+
register_amd_ci(est_time=60, suite="stage-b-test-small-1-gpu-amd") # optional
157+
```
158+
159+
Parameters:
160+
- `est_time`: estimated runtime in seconds (used for CI partitioning)
161+
- `suite`: which CI suite to run in (see below)
162+
- `nightly=True`: for nightly-only tests (default `False` = per-commit)
163+
- `disabled="reason"`: temporarily disable with explanation
164+
165+
### Suite selection guide
166+
167+
**Default cases (1 GPU):**
168+
169+
| Scenario | Model | Suite |
170+
|----------|-------|-------|
171+
| Model-agnostic basic functionality | 1B (smallest) | `stage-b-test-small-1-gpu` |
172+
| General performance (no spec/DP/parallelism) | 8B | `stage-b-test-large-1-gpu` |
173+
174+
**Bigger features (case by case):**
175+
176+
| Scenario | Suite |
177+
|----------|-------|
178+
| 2 GPU (e.g. TP=2) | `stage-b-test-large-2-gpu` |
179+
| 4 GPU (H100) | `stage-c-test-4-gpu-h100` |
180+
| 8 GPU (H200) | `stage-c-test-8-gpu-h200` |
181+
| Nightly, 1 GPU | `nightly-1-gpu` |
182+
| Nightly, 8 GPU | `nightly-8-gpu` |
183+
184+
For spec, DP attention, parallelism, disaggregation, etc., discuss with the team to determine the appropriate suite and GPU configuration.
185+
186+
---
187+
188+
## Model Constants
189+
190+
All defined in `python/sglang/test/test_utils.py`:
191+
192+
| Constant | Model | When to use |
193+
|----------|-------|-------------|
194+
| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` | Llama-3.2-1B-Instruct | Model-agnostic basic functionality |
195+
| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST_BASE` | Llama-3.2-1B | Base (non-instruct) model tests |
196+
| `DEFAULT_MODEL_NAME_FOR_TEST` | Llama-3.1-8B-Instruct | General performance (single node) |
197+
| `DEFAULT_MOE_MODEL_NAME_FOR_TEST` | Mixtral-8x7B-Instruct | MoE-specific tests |
198+
| `DEFAULT_SMALL_EMBEDDING_MODEL_NAME_FOR_TEST` || Embedding tests |
199+
| `DEFAULT_SMALL_VLM_MODEL_NAME_FOR_TEST` || Vision-language tests |
200+
201+
---
202+
203+
## Test Placement
204+
205+
```
206+
test/
207+
├── registered/ # CI tests (auto-discovered by run_suite.py)
208+
│ ├── sampling/ # test_penalty.py, test_sampling_params.py ...
209+
│ ├── sessions/ # test_session_control.py ...
210+
│ ├── openai_server/ # basic/, features/, validation/ ...
211+
│ ├── spec/ # eagle/, utils/ ...
212+
│ ├── models/ # model-specific accuracy tests
213+
│ ├── perf/ # performance benchmarks
214+
│ └── <category>/ # create new category if needed
215+
├── manual/ # Non-CI: debugging, one-off, manual verification
216+
└── run_suite.py # CI runner (scans registered/ only)
217+
```
218+
219+
**Decision rule**: if the test should run in CI → `registered/`. If it's for local debugging or requires special hardware not in CI → `manual/`.
220+
221+
---
222+
223+
## Key Utilities
224+
225+
```python
226+
from sglang.test.test_utils import (
227+
CustomTestCase, # base class with retry logic
228+
popen_launch_server, # launch server subprocess
229+
DEFAULT_URL_FOR_TEST, # auto-configured base URL
230+
DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, # 600s default
231+
run_bench_serving, # benchmark helper (launch + bench)
232+
)
233+
from sglang.srt.utils import kill_process_tree # cleanup server
234+
```
235+
236+
---
237+
238+
## Checklist
239+
240+
Before submitting a test:
241+
242+
- [ ] Inherits from `CustomTestCase` (not `unittest.TestCase`)
243+
- [ ] Has `register_*_ci(...)` call at module level
244+
- [ ] Placed in `test/registered/<category>/`
245+
- [ ] Model selection: smallest for model-agnostic features, 8B for general perf, case-by-case for other complex features
246+
- [ ] `setUpClass` launches server, `tearDownClass` kills it
247+
- [ ] Has `if __name__ == "__main__": unittest.main(verbosity=3)`
248+
- [ ] `est_time` is reasonable (measure locally)

0 commit comments

Comments
 (0)