sgl-project
diff --git a/‎.claude/skills/add-jit-kernel/SKILL.md‎
Lines changed: 440 additions & 157 deletions b/‎.claude/skills/add-jit-kernel/SKILL.md‎
Lines changed: 440 additions & 157 deletions
diff --git a/‎.claude/skills/add-sgl-kernel/SKILL.md‎
Lines changed: 224 additions & 83 deletions b/‎.claude/skills/add-sgl-kernel/SKILL.md‎
Lines changed: 224 additions & 83 deletions
diff --git a/‎.claude/skills/diffusion/use-efficient-diffusion-kernels.md‎
Lines changed: 111 additions & 0 deletions b/‎.claude/skills/diffusion/use-efficient-diffusion-kernels.md‎
Lines changed: 111 additions & 0 deletions
diff --git a/‎.claude/skills/write-sglang-test/SKILL.md‎
Lines changed: 248 additions & 0 deletions b/‎.claude/skills/write-sglang-test/SKILL.md‎
Lines changed: 248 additions & 0 deletions
@@ -0,0 +1,111 @@
+---
+name: use-efficient-diffusion-kernels
+description: Guidance for using SGLang Diffusion fused kernels and fast CUDA paths. Use when mapping fusion patterns in diffusion inference, choosing fused ops or attention backends, handling RoPE/QK norm performance pitfalls, or integrating new diffusion models with kernel-aware constraints.
+---
+
+# Use Efficient Diffusion Kernels
+
+**Overview**
+This skill focuses on SGLang Diffusion (`sglang.multimodal_gen`) kernel fusion patterns and fast CUDA paths. Prefer existing fused ops (Triton, CuTe DSL, sgl-kernel). Make constraints and fallbacks explicit.
+
+**Key Files**
+- `python/sglang/multimodal_gen/runtime/layers/layernorm.py`
+- `python/sglang/multimodal_gen/runtime/layers/elementwise.py`
+- `python/sglang/multimodal_gen/runtime/layers/rotary_embedding/utils.py`
+- `python/sglang/jit_kernel/diffusion/triton/scale_shift.py`
+- `python/sglang/jit_kernel/diffusion/triton/norm.py`
+- `python/sglang/jit_kernel/diffusion/triton/rmsnorm_onepass.py`
+- `python/sglang/jit_kernel/diffusion/triton/rotary.py`
+- `python/sglang/jit_kernel/diffusion/cutedsl/scale_residual_norm_scale_shift.py`
+- `python/sglang/jit_kernel/norm.py`
+- `python/sglang/multimodal_gen/runtime/platforms/cuda.py`
+- `python/sglang/multimodal_gen/runtime/layers/attention/selector.py`
+- `docs/diffusion/performance/attention_backends.md`
+
+**Core Fusion Patterns**
+
+1. Scale/Shift elementwise fusion (AdaLN modulation)
+- Kernels: `fuse_scale_shift_kernel`, `fuse_scale_shift_gate_select01_kernel`
+- Locations: `elementwise.py`, `layernorm.py`, `qwen_image.py`, `triton/scale_shift.py`
+- Use cases: `x * (1 + scale) + shift` and `a * (k + b) + c`
+- Constraints: `x` must be CUDA and contiguous. `scale/shift` support 0D/1D/2D/3D/4D broadcast. 4D `[B, F, 1, C]` requires `L % F == 0`.
+- NPU fallback: `scale_shift.py` swaps to `npu_fallback` native path.
+
+2. Norm + Scale/Shift fusion (CuTe DSL)
+- Kernels: `fused_norm_scale_shift`, `fused_scale_residual_norm_scale_shift`
+- Locations: `layernorm.py`, `cutedsl/scale_residual_norm_scale_shift.py`
+- Use cases:
+  - `y = norm(x) * (1 + scale) + shift`
+  - `y = norm(residual + gate * x) * (1 + scale) + shift`
+- Constraints: `D % 256 == 0` and `D <= 8192`. `x/residual/gate/scale/shift` must pass shape and stride validation. Dtypes limited to fp16/bf16/fp32.
+- Behavior: CuTe DSL compilation cached by `(dtype, ndim, D, norm_type)`. `None` tensors replaced by scalar placeholders. If constraints fail, `layernorm.py` warns and falls back to native PyTorch.
+
+3. Triton LayerNorm/RMSNorm fusion
+- Kernels: `rms_norm_fn`, `layer_norm_fn`, `norm_infer`
+- Locations: `triton/norm.py`, `layernorm.py`
+- Use cases: fp32 RMSNorm with residual/dropout/rowscale/x1 branches, and inference-friendly `norm_infer`.
+- Constraints: last dim must be contiguous, and `N * element_size < 64KB`.
+
+4. Triton one-pass RMSNorm (small hidden size fast path)
+- Kernel: `triton_one_pass_rms_norm`
+- Locations: `triton/rmsnorm_onepass.py`, `layernorm.py`
+- Use case: `hidden_size <= 128` in `RMSNorm.forward_cuda`.
+
+5. Triton RoPE fusion
+- Kernel: `apply_rotary_embedding`
+- Locations: `triton/rotary.py`, `rotary_embedding/utils.py`
+- Use case: GPT-J style RoPE when not Neox.
+- Constraints: `head_size` must be even.
+- NPU fallback: `npu_fallback.apply_rotary_embedding_native`.
+
+**Faster CUDA Kernel Usage Points**
+
+1. sgl-kernel RMSNorm and fused add RMSNorm
+- Location: `layernorm.py`
+- Behavior: CUDA uses `sgl_kernel.fused_add_rmsnorm` and `sgl_kernel.rmsnorm`. `hidden_size <= 128` uses Triton one-pass. ROCm falls back to native.
+
+2. Attention backend selection (FlashAttention, Sage, SDPA)
+- Locations: `platforms/cuda.py`, `attention/selector.py`, `docs/diffusion/performance/attention_backends.md`
+- Behavior: CUDA prefers FlashAttention (FA3/FA4) when supported, otherwise Torch SDPA. Force via `--attention-backend` or `global_force_attn_backend`.
+
+3. FlashInfer RoPE (Q/K inplace)
+- Location: `rotary_embedding/utils.py`
+- Behavior: `flashinfer.rope.apply_rope_with_cos_sin_cache_inplace` when available, otherwise Triton RoPE fallback.
+
+**QK Norm Optimization**
+
+- Entry point: `apply_qk_norm` in `layernorm.py`.
+- Fast path: JIT fused inplace QK norm from `python/sglang/jit_kernel/norm.py` via `fused_inplace_qknorm`.
+- Preconditions for fused path:
+  - CUDA only.
+  - `allow_inplace=True` and `q_eps == k_eps`.
+  - `can_use_fused_inplace_qknorm(head_dim, dtype)` returns true.
+  - Supported head dims: `64, 128, 256, 512, 1024`.
+- Behavior: Fused path operates on `q` and `k` in place after reshaping to `[B, -1, head_dim]`. If preconditions fail, fall back to per-tensor RMSNorm.
+
+**Common Entry Points in Diffusion Models**
+- AdaLN modulation: `LayerNormScaleShift`, `RMSNormScaleShift`, `ScaleResidual*` in `layernorm.py`.
+- Qwen-Image gating: `fuse_scale_shift_gate_select01_kernel` in `qwen_image.py`.
+- QK norm: `apply_qk_norm` used in `flux.py`, `flux_2.py`, `qwen_image.py`, `zimage.py`, `wanvideo.py`, `ltx_2.py`, `hunyuanvideo.py`.
+- RoPE: `_apply_rotary_emb` prefers Triton; Q/K RoPE prefers FlashInfer when present.
+
+**Constraints and Fallbacks**
+- `scale_shift` Triton requires CUDA + contiguous `x`. NPU swaps to native.
+- CuTe DSL fused norms require `D % 256 == 0` and `D <= 8192`.
+- Triton norm kernels error on feature size >= 64KB.
+- FlashAttention requires fp16/bf16 and SM80+; otherwise SDPA.
+
+**Integration Checklist for New Models**
+
+1. Reuse `LayerNormScaleShift` or `ScaleResidual*` modules instead of re-implementing fusion logic.
+2. Keep tensors contiguous and satisfy D alignment (`% 256`) and size (`<= 8192`) for CuTe fused paths.
+3. Use `fuse_scale_shift_kernel` for AdaLN modulation and keep a PyTorch fallback.
+4. Use `apply_qk_norm` and ensure head_dim is in the supported list for fused QK norm.
+5. If using FlashInfer RoPE, avoid `pack qkv` and ensure Q/K are contiguous.
+6. For attention, follow `selector.py` priority; override with CLI only if needed.
+
+**When Extending or Modifying Kernels**
+- Add `torch.library.custom_op` and `register_fake` for compile and meta support.
+- Keep CuTe compile cache keys aligned to `(dtype, ndim, D)`.
+- Avoid implicit broadcasts that force hidden `contiguous()` copies.
+- Preserve NPU and ROCm fallback paths.
@@ -0,0 +1,248 @@
+---
+name: write-sglang-test
+description: Guide for writing SGLang CI/UT tests following project conventions. Covers CustomTestCase, CI registration, server fixtures, model selection, and test placement. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.
+---
+
+# Writing SGLang CI / UT Tests
+
+## Core Rules
+
+1. **Always use `CustomTestCase`** — never raw `unittest.TestCase`
+2. **Place tests in `test/registered/<category>/`** — only use `test/manual/` for debugging / non-CI tests
+3. **Reuse server fixtures** — inherit from `DefaultServerBase` or write `setUpClass`/`tearDownClass` with `popen_launch_server`
+4. **Smallest model for model-agnostic functionality** — use `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (Llama-3.2-1B-Instruct) for basic features that don't depend on model size
+5. **8B for general performance** — use `DEFAULT_MODEL_NAME_FOR_TEST` (Llama-3.1-8B-Instruct, single-node) for performance tests that don't involve spec / DP / parallelism
+6. **Bigger features → discuss case by case** — spec, DP attention, tensor/pipeline parallelism etc. may need multi-GPU suites and specific models
+
+---
+
+## Test File Template
+
+### Functional correctness test (small model)
+
+```python
+import unittest
+
+import requests
+
+from sglang.srt.utils import kill_process_tree
+from sglang.test.ci.ci_register import register_cuda_ci
+from sglang.test.test_utils import (
+    DEFAULT_SMALL_MODEL_NAME_FOR_TEST,
+    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+    DEFAULT_URL_FOR_TEST,
+    CustomTestCase,
+    popen_launch_server,
+)
+
+register_cuda_ci(est_time=60, suite="stage-b-test-small-1-gpu")
+
+
+class TestMyFeature(CustomTestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
+        cls.base_url = DEFAULT_URL_FOR_TEST
+        cls.process = popen_launch_server(
+            cls.model,
+            cls.base_url,
+            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+            other_args=["--arg1", "value1"],  # feature-specific args
+        )
+
+    @classmethod
+    def tearDownClass(cls):
+        kill_process_tree(cls.process.pid)
+
+    def test_basic_functionality(self):
+        response = requests.post(
+            self.base_url + "/generate",
+            json={"text": "Hello", "sampling_params": {"max_new_tokens": 32}},
+        )
+        self.assertEqual(response.status_code, 200)
+
+
+if __name__ == "__main__":
+    unittest.main(verbosity=3)
+```
+
+### General performance test (8B model, single node, no spec/DP/parallelism)
+
+```python
+import time
+import unittest
+
+import requests
+
+from sglang.srt.utils import kill_process_tree
+from sglang.test.ci.ci_register import register_cuda_ci
+from sglang.test.test_utils import (
+    DEFAULT_MODEL_NAME_FOR_TEST,
+    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+    DEFAULT_URL_FOR_TEST,
+    CustomTestCase,
+    popen_launch_server,
+)
+
+register_cuda_ci(est_time=300, suite="stage-b-test-large-1-gpu")
+
+
+class TestMyFeaturePerf(CustomTestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.model = DEFAULT_MODEL_NAME_FOR_TEST
+        cls.base_url = DEFAULT_URL_FOR_TEST
+        cls.process = popen_launch_server(
+            cls.model,
+            cls.base_url,
+            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+        )
+
+    @classmethod
+    def tearDownClass(cls):
+        kill_process_tree(cls.process.pid)
+
+    def test_latency(self):
+        start = time.perf_counter()
+        response = requests.post(
+            self.base_url + "/generate",
+            json={"text": "Hello", "sampling_params": {"max_new_tokens": 128}},
+        )
+        elapsed = time.perf_counter() - start
+        self.assertEqual(response.status_code, 200)
+        self.assertLess(elapsed, 5.0, "Latency exceeded threshold")
+
+
+if __name__ == "__main__":
+    unittest.main(verbosity=3)
+```
+
+---
+
+## Server Fixture Reuse
+
+For tests that only need a standard server, inherit from `DefaultServerBase` and override class attributes:
+
+```python
+from sglang.test.server_fixtures.default_fixture import DefaultServerBase
+
+class TestMyFeature(DefaultServerBase):
+    model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
+    other_args = ["--enable-my-feature"]
+
+    def test_something(self):
+        ...
+```
+
+Available fixtures in `python/sglang/test/server_fixtures/`:
+
+| Fixture | Use case |
+|---------|----------|
+| `DefaultServerBase` | Standard single-server tests |
+| `EagleServerBase` | EAGLE speculative decoding |
+| `PDDisaggregationServerBase` | Disaggregated prefill/decode |
+| `MMMUServerBase` | Multimodal VLM tests |
+
+---
+
+## CI Registration
+
+Every test file in `test/registered/` **must** call a registration function at module level:
+
+```python
+from sglang.test.ci.ci_register import register_cuda_ci, register_amd_ci
+
+register_cuda_ci(est_time=60, suite="stage-b-test-small-1-gpu")
+register_amd_ci(est_time=60, suite="stage-b-test-small-1-gpu-amd")  # optional
+```
+
+Parameters:
+- `est_time`: estimated runtime in seconds (used for CI partitioning)
+- `suite`: which CI suite to run in (see below)
+- `nightly=True`: for nightly-only tests (default `False` = per-commit)
+- `disabled="reason"`: temporarily disable with explanation
+
+### Suite selection guide
+
+**Default cases (1 GPU):**
+
+| Scenario | Model | Suite |
+|----------|-------|-------|
+| Model-agnostic basic functionality | 1B (smallest) | `stage-b-test-small-1-gpu` |
+| General performance (no spec/DP/parallelism) | 8B | `stage-b-test-large-1-gpu` |
+
+**Bigger features (case by case):**
+
+| Scenario | Suite |
+|----------|-------|
+| 2 GPU (e.g. TP=2) | `stage-b-test-large-2-gpu` |
+| 4 GPU (H100) | `stage-c-test-4-gpu-h100` |
+| 8 GPU (H200) | `stage-c-test-8-gpu-h200` |
+| Nightly, 1 GPU | `nightly-1-gpu` |
+| Nightly, 8 GPU | `nightly-8-gpu` |
+
+For spec, DP attention, parallelism, disaggregation, etc., discuss with the team to determine the appropriate suite and GPU configuration.
+
+---
+
+## Model Constants
+
+All defined in `python/sglang/test/test_utils.py`:
+
+| Constant | Model | When to use |
+|----------|-------|-------------|
+| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` | Llama-3.2-1B-Instruct | Model-agnostic basic functionality |
+| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST_BASE` | Llama-3.2-1B | Base (non-instruct) model tests |
+| `DEFAULT_MODEL_NAME_FOR_TEST` | Llama-3.1-8B-Instruct | General performance (single node) |
+| `DEFAULT_MOE_MODEL_NAME_FOR_TEST` | Mixtral-8x7B-Instruct | MoE-specific tests |
+| `DEFAULT_SMALL_EMBEDDING_MODEL_NAME_FOR_TEST` | — | Embedding tests |
+| `DEFAULT_SMALL_VLM_MODEL_NAME_FOR_TEST` | — | Vision-language tests |
+
+---
+
+## Test Placement
+
+```
+test/
+├── registered/          # CI tests (auto-discovered by run_suite.py)
+│   ├── sampling/        # test_penalty.py, test_sampling_params.py ...
+│   ├── sessions/        # test_session_control.py ...
+│   ├── openai_server/   # basic/, features/, validation/ ...
+│   ├── spec/            # eagle/, utils/ ...
+│   ├── models/          # model-specific accuracy tests
+│   ├── perf/            # performance benchmarks
+│   └── <category>/      # create new category if needed
+├── manual/              # Non-CI: debugging, one-off, manual verification
+└── run_suite.py         # CI runner (scans registered/ only)
+```
+
+**Decision rule**: if the test should run in CI → `registered/`. If it's for local debugging or requires special hardware not in CI → `manual/`.
+
+---
+
+## Key Utilities
+
+```python
+from sglang.test.test_utils import (
+    CustomTestCase,              # base class with retry logic
+    popen_launch_server,         # launch server subprocess
+    DEFAULT_URL_FOR_TEST,        # auto-configured base URL
+    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,  # 600s default
+    run_bench_serving,           # benchmark helper (launch + bench)
+)
+from sglang.srt.utils import kill_process_tree  # cleanup server
+```
+
+---
+
+## Checklist
+
+Before submitting a test:
+
+- [ ] Inherits from `CustomTestCase` (not `unittest.TestCase`)
+- [ ] Has `register_*_ci(...)` call at module level
+- [ ] Placed in `test/registered/<category>/`
+- [ ] Model selection: smallest for model-agnostic features, 8B for general perf, case-by-case for other complex features
+- [ ] `setUpClass` launches server, `tearDownClass` kills it
+- [ ] Has `if __name__ == "__main__": unittest.main(verbosity=3)`
+- [ ] `est_time` is reasonable (measure locally)