diff --git a/‎README.md‎
Lines changed: 6 additions & 0 deletions b/‎README.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎benchmark/BENCHMARK_GUIDELINES.md‎
Lines changed: 101 additions & 0 deletions b/‎benchmark/BENCHMARK_GUIDELINES.md‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎benchmark/scripts/benchmark_dyt.py‎
Lines changed: 43 additions & 89 deletions b/‎benchmark/scripts/benchmark_dyt.py‎
Lines changed: 43 additions & 89 deletions
@@ -263,6 +263,7 @@ loss.backward()
 | Qwen2.5-VL       | `liger_kernel.transformers.apply_liger_kernel_to_qwen2_5_vl`    | RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Qwen3   | `liger_kernel.transformers.apply_liger_kernel_to_qwen3`    |  RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy       |
 | Qwen3 MoE | `liger_kernel.transformers.apply_liger_kernel_to_qwen3_moe` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy       |
+| Qwen3.5      | `liger_kernel.transformers.apply_liger_kernel_to_qwen3_5`    | RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Phi3 & Phi3.5       | `liger_kernel.transformers.apply_liger_kernel_to_phi3`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
 | Granite 3.0 & 3.1   | `liger_kernel.transformers.apply_liger_kernel_to_granite`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss |
 | OLMo2   | `liger_kernel.transformers.apply_liger_kernel_to_olmo2`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
@@ -365,6 +366,11 @@ loss.backward()
                     <img src="https://github.com/linkedin/Liger-Kernel/actions/workflows/intel-ci.yml/badge.svg?branch=main&event=push" alt="Build">
                 </a>
             </div>
+            <div style="display: block;">
+                <a href="https://github.com/xuedinge233/Liger-Kernel/actions/workflows/ascend_npu_ci.yml">
+                    <img src="https://github.com/xuedinge233/Liger-Kernel/actions/workflows/ascend_npu_ci.yml/badge.svg?branch=main" alt="Build">
+                </a>
+            </div>
         </td>
     </tr>
 </table>
 
@@ -0,0 +1,101 @@
+# Guideline for Adding Benchmark Scripts
+
+This document describes how to add new benchmark scripts to Liger-Kernel in line with the shared framework.
+
+## 1. Where and how to add a script
+
+- **Location**: `benchmark/scripts/`
+- **Naming**: `benchmark_<kernel_name>.py` (e.g. `benchmark_geglu.py`, `benchmark_swiglu.py`)
+
+## 2. Use shared infrastructure
+
+Do **not** hardcode batch size, sequence length, or model dimensions. Use:
+
+| Need | Use |
+|------|-----|
+| Model dimensions (hidden_size, vocab_size, etc.) | `benchmark_model_configs.py`: `ModelConfig`, `get_benchmark_model_config()` |
+| Safe sweep config (seq_len or hidden_size) | `compute_seq_len_sweep_config()` (returns `SeqLenSweepConfig`) or `compute_hidden_size_sweep_config()` (returns `HiddenSizeSweepConfig`), with optional `estimate_kernel_peak_memory()` |
+| Speed / memory measurement | `utils.py`: `run_speed_benchmark()`, `run_memory_benchmark()` |
+| CLI (overwrite, model choice) | `utils.py`: `parse_benchmark_script_args()` (includes `--model`) |
+| Running the grid and writing CSV | `utils.py`: `run_benchmarks()` |
+
+## 3. Script structure (three parts)
+
+### 3.1 Setup factory
+
+Define a single **setup function** that builds inputs and the layer (or callable) from `SingleBenchmarkRunInput`, so both speed and memory benchmarks reuse the same setup.
+
+- **Signature**: `_setup_<kernel>(input: SingleBenchmarkRunInput) -> (tensors, layer_or_fn)`
+- **Input**: `input.x` is the varying dimension (e.g. sequence length); `input.extra_benchmark_config` holds `bsz`, `hidden_size`, `dtype`, etc.; `input.kernel_provider` identifies the implementation variant (e.g. `"liger"`, `"huggingface"`, `"torch"`; values are kernel-specific).
+- **Return**: Whatever the benchmark helpers need (e.g. `(x, layer)` for a single-tensor forward like GEGLU).
+
+Example (conceptually):
+
+```python
+def _setup_geglu(input: SingleBenchmarkRunInput):
+    cfg = input.extra_benchmark_config
+    # Build config, create x tensor, instantiate LigerGEGLUMLP or LlamaMLP by provider
+    return x, layer
+```
+
+### 3.2 Speed and memory benchmark functions
+
+Each takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput` by calling the shared helpers.
+
+- **Speed**: `run_speed_benchmark(fwd_fn, mode, input_tensors, rep=...)`
+- **Memory**: `run_memory_benchmark(fwd_fn, mode)`
+- **Modes**: Use `["full", "forward", "backward"]` for both speed and memory for consistency.
+
+Example:
+
+```python
+def bench_speed_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
+    x, layer = _setup_geglu(input)
+    return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x])
+
+def bench_memory_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
+    x, layer = _setup_geglu(input)
+    return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode)
+```
+
+For **scalar output** (e.g. loss) or **multiple outputs** (e.g. RoPE), use the appropriate helpers from `utils.py` if available (e.g. loss or multi-output variants), or implement custom measurement and still use the same setup factory and `run_benchmarks()`.
+
+### 3.3 `__main__`: model config, shape computation, run
+
+1. Parse args: `args = parse_benchmark_script_args()` and resolve `model = get_benchmark_model_config(args.model)`.
+2. (Recommended) Measure peak memory with a small probe using the **highest-memory baseline** implementation (e.g. `"huggingface"` or `"torch"`):
+   - Define a `_probe()` function that creates tensors/layers, runs a forward pass, and returns the output tensor. `_probe()` owns setup; `estimate_kernel_peak_memory` handles memory-stat reset before the call, runs `.backward()`, and performs cleanup (gc + cache clear) afterward.
+   - Call `peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)`.
+3. Compute sweep config (device memory is obtained internally by both helpers):
+   - **Sequence-length sweep** (e.g. GEGLU, SwiGLU): convert peak bytes to per-token (`kernel_bpt = peak_bytes // probe_seq_len`), then `config = compute_seq_len_sweep_config(model, kernel_bytes_per_token=kernel_bpt)`. The returned `SeqLenSweepConfig` has `batch_size` and `seq_len`.
+   - **Hidden-size sweep** (e.g. DyT): pass total peak bytes directly: `config = compute_hidden_size_sweep_config(model, kernel_peak_bytes=peak_bytes, bt=BT)`. The returned `HiddenSizeSweepConfig` has `bt` and `max_hidden_size`.
+4. Build `x_values` from `config.seq_len` (seq_len sweep) or `config.max_hidden_size` (hidden_size sweep).
+5. Build `extra_benchmark_configs` from `model` and config:
+   - Seq_len sweep: e.g. `bsz=config.batch_size`, `hidden_size=model.hidden_size`, `dtype=model.dtype`.
+   - Hidden_size sweep: e.g. `BT=config.bt`, `dtype=model.dtype`.
+6. Call `run_benchmarks(..., kernel_operation_modes=["full", "forward", "backward"], ...)` for both speed and memory.
+
+## 4. CLI
+
+Scripts should support:
+
+- `--overwrite`: overwrite existing rows in the benchmark CSV.
+- `--model`: model profile name from `MODEL_REGISTRY` (e.g. `llama_2_7b`, `llama_3_8b`). Default when not set is `DEFAULT_MODEL_CONFIG` (e.g. `llama_3_8b`).
+
+These are provided by `parse_benchmark_script_args()` in `utils.py`.
+
+## 5. Reference scripts
+
+- **Element-wise (single tensor in/out, seq_len sweep)**: `benchmark_geglu.py`, `benchmark_swiglu.py` — `compute_seq_len_sweep_config()`.
+- **Element-wise (single tensor in/out, hidden_size sweep)**: `benchmark_dyt.py` — `compute_hidden_size_sweep_config()`.
+
+## 6. Checklist for a new script
+
+- [ ] Script under `benchmark/scripts/` named `benchmark_<kernel>.py`.
+- [ ] Single `_setup_<kernel>(SingleBenchmarkRunInput)` used by both speed and memory.
+- [ ] Speed/memory implemented via `run_speed_benchmark` / `run_memory_benchmark` (or the correct variant for loss / multi-output).
+- [ ] `kernel_operation_modes=["full", "forward", "backward"]` for both speed and memory.
+- [ ] No hardcoded batch size or sequence length; use `compute_seq_len_sweep_config()` or `compute_hidden_size_sweep_config()` (and optionally `estimate_kernel_peak_memory()`).
+- [ ] Model dimensions and dtype from `ModelConfig` / `get_benchmark_model_config()` / `args.model`.
+- [ ] CLI via `parse_benchmark_script_args()` (so `--model` and `--overwrite` work).
+- [ ] Results written through `run_benchmarks()` so data goes to the shared CSV.
@@ -2,14 +2,16 @@
 import sys
 
 import torch
-import triton
 
-from utils import QUANTILES
+from benchmark_model_configs import compute_hidden_size_sweep_config
+from benchmark_model_configs import estimate_kernel_peak_memory
+from benchmark_model_configs import get_benchmark_model_config
 from utils import SingleBenchmarkRunInput
 from utils import SingleBenchmarkRunOutput
-from utils import _test_memory
 from utils import parse_benchmark_script_args
 from utils import run_benchmarks
+from utils import run_memory_benchmark
+from utils import run_speed_benchmark
 
 from liger_kernel.utils import infer_device
 
@@ -18,124 +20,76 @@
 sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
 
 
-def bench_speed_dyt(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
+def _setup_dyt(input: SingleBenchmarkRunInput):
+    """Create input tensor and DyT layer from benchmark config."""
     from test.transformers.test_dyt import LigerDyT
     from test.transformers.test_dyt import TorchDyT
 
+    cfg = input.extra_benchmark_config
     hidden_size = input.x
-    provider = input.kernel_provider
-    mode = input.kernel_operation_mode
-    extra_benchmark_config = input.extra_benchmark_config
-    BT = extra_benchmark_config["BT"]
-    beta = extra_benchmark_config["beta"]
-    dtype = extra_benchmark_config["dtype"]
-
-    x_shape = (BT, hidden_size)
-    torch_dyt = TorchDyT(hidden_size=hidden_size, beta=beta).to(device)
-    torch_compile_dyt = torch.compile(TorchDyT(hidden_size=hidden_size, beta=beta).to(device))
-    triton_dyt = LigerDyT(hidden_size=hidden_size, beta=beta).to(device)
-
-    x = torch.randn(x_shape, dtype=dtype, device=device)
-    dy = torch.randn_like(x)
-    x.requires_grad_(True)
-
-    def fwd():
-        if provider == "liger":
-            return triton_dyt(x)
-        elif provider == "torch":
-            return torch_dyt(x)
-        elif provider == "torch_compile":
-            return torch_compile_dyt(x)
-
-    if mode == "forward":
-        ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd, quantiles=QUANTILES, grad_to_none=[x], rep=500)
-    elif mode == "backward":
-        y = fwd()
-        ms_50, ms_20, ms_80 = triton.testing.do_bench(
-            lambda: y.backward(dy, retain_graph=True),
-            quantiles=QUANTILES,
-            grad_to_none=[x],
-            rep=500,
-        )
-    elif mode == "full":
-
-        def full():
-            y = fwd()
-            y.backward(dy)
+    x = torch.randn(cfg["BT"], hidden_size, device=device, dtype=cfg["dtype"], requires_grad=True)
+    if input.kernel_provider == "liger":
+        layer = LigerDyT(hidden_size=hidden_size, beta=cfg["beta"]).to(device)
+    elif input.kernel_provider == "torch":
+        layer = TorchDyT(hidden_size=hidden_size, beta=cfg["beta"]).to(device)
+    elif input.kernel_provider == "torch_compile":
+        layer = torch.compile(TorchDyT(hidden_size=hidden_size, beta=cfg["beta"]).to(device))
+    else:
+        raise ValueError(f"Invalid provider: {input.kernel_provider} for DyT")
+    return x, layer
 
-        ms_50, ms_20, ms_80 = triton.testing.do_bench(full, quantiles=QUANTILES, grad_to_none=[x], rep=500)
 
-    return SingleBenchmarkRunOutput(
-        y_20=ms_20,
-        y_50=ms_50,
-        y_80=ms_80,
-    )
+def bench_speed_dyt(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
+    x, layer = _setup_dyt(input)
+    return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x])
 
 
 def bench_memory_dyt(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
-    from test.transformers.test_dyt import LigerDyT
-    from test.transformers.test_dyt import TorchDyT
+    x, layer = _setup_dyt(input)
+    return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode)
 
-    hidden_size = input.x
-    provider = input.kernel_provider
-    extra_benchmark_config = input.extra_benchmark_config
-    BT = extra_benchmark_config["BT"]
-    beta = extra_benchmark_config["beta"]
-    dtype = extra_benchmark_config["dtype"]
-
-    x_shape = (BT, hidden_size)
-    torch_dyt = TorchDyT(hidden_size=hidden_size, beta=beta).to(device)
-    torch_compile_dyt = torch.compile(TorchDyT(hidden_size=hidden_size, beta=beta).to(device))
-    triton_dyt = LigerDyT(hidden_size=hidden_size, beta=beta).to(device)
-
-    x = torch.randn(x_shape, dtype=dtype, device=device)
-    dy = torch.randn_like(x)
-    x.requires_grad_(True)
-
-    def fwd():
-        if provider == "liger":
-            return triton_dyt(x)
-        elif provider == "torch":
-            return torch_dyt(x)
-        elif provider == "torch_compile":
-            return torch_compile_dyt(x)
-
-    def full():
-        y = fwd()
-        y.backward(dy, retain_graph=True)
-
-    mem_50, mem_20, mem_80 = _test_memory(full, quantiles=QUANTILES)
-    return SingleBenchmarkRunOutput(
-        y_20=mem_20,
-        y_50=mem_50,
-        y_80=mem_80,
-    )
 
+BT = 4096
 
 if __name__ == "__main__":
     args = parse_benchmark_script_args()
+    model = get_benchmark_model_config(args.model)
 
     for beta in [False, True]:
+
+        def _probe():
+            probe_input = SingleBenchmarkRunInput(
+                x=model.hidden_size,
+                kernel_provider="torch",
+                extra_benchmark_config={"BT": BT, "dtype": model.dtype, "beta": beta},
+            )
+            x, layer = _setup_dyt(probe_input)
+            return layer(x)
+
+        peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)
+        sweep_config = compute_hidden_size_sweep_config(model, peak_bytes, bt=BT)
+        x_values = [1024 * i for i in range(1, 17) if 1024 * i <= sweep_config.max_hidden_size] or [model.hidden_size]
+
         common_configs = {
             "kernel_name": f"dyt_beta={beta}",
             "x_name": "hidden_size",
             "x_label": "hidden_size",
-            "x_values": [1024 * i for i in range(1, 17)],
+            "x_values": x_values,
             "kernel_providers": ["liger", "torch", "torch_compile"],
-            "extra_benchmark_configs": [{"BT": 4096, "dtype": torch.bfloat16, "beta": beta}],
+            "extra_benchmark_configs": [{"BT": sweep_config.bt, "dtype": model.dtype, "beta": beta}],
             "overwrite": args.overwrite,
         }
 
         run_benchmarks(
             bench_test_fn=bench_speed_dyt,
-            kernel_operation_modes=["forward", "backward", "full"],
+            kernel_operation_modes=["full", "forward", "backward"],
             metric_name="speed",
             metric_unit="ms",
             **common_configs,
         )
         run_benchmarks(
             bench_test_fn=bench_memory_dyt,
-            kernel_operation_modes=["full"],
+            kernel_operation_modes=["full", "forward", "backward"],
             metric_name="memory",
             metric_unit="MB",
             **common_configs,