linkedin · noemotiovon · Mar 24, 2026 · Mar 25, 2026 · Mar 27, 2026
diff --git a/benchmark/BENCHMARK_GUIDELINES.md b/benchmark/BENCHMARK_GUIDELINES.md
@@ -1,52 +1,45 @@
 # Guideline for Adding Benchmark Scripts
 
-This document describes how to add new benchmark scripts to Liger-Kernel in line with the shared framework.
-
-## 1. Where and how to add a script
+## 1. Where to add a script
 
 - **Location**: `benchmark/scripts/`
-- **Naming**: `benchmark_<kernel_name>.py` (e.g. `benchmark_geglu.py`, `benchmark_swiglu.py`)
+- **Naming**: `benchmark_<kernel_name>.py` (e.g. `benchmark_geglu.py`, `benchmark_dyt.py`)
+
+> **Baseline implementations**: Import reference (non-Liger) kernels from the
+> test suite (e.g. `test/transformers/test_<kernel>.py`) to use as baselines.
+> This keeps benchmark and test implementations in sync and avoids duplicating
+> reference code in benchmark scripts.
 
-## 2. Use shared infrastructure
+## 2. Shared infrastructure
 
-Do **not** hardcode batch size, sequence length, or model dimensions. Use:
+Do **not** hardcode batch size, sequence length, or model dimensions. All benchmark scripts share the following:
 
 | Need | Use |
 |------|-----|
-| Model dimensions (hidden_size, vocab_size, etc.) | `benchmark_model_configs.py`: `ModelConfig`, `get_benchmark_model_config()` |
-| Safe sweep config (seq_len or hidden_size) | `compute_seq_len_sweep_config()` (returns `SeqLenSweepConfig`) or `compute_hidden_size_sweep_config()` (returns `HiddenSizeSweepConfig`), with optional `estimate_kernel_peak_memory()` |
+| Model dimensions (hidden_size, vocab_size, etc.) | `benchmark_model_configs.py`: `ModelConfig`, `MODEL_REGISTRY`, `get_benchmark_model_config()` |
+| Memory probing | `benchmark_model_configs.py`: `estimate_kernel_peak_memory()` |
+| Safe sweep configs | `compute_seq_len_sweep_config()`, `compute_model_config_sweep_config()` |
 | Speed / memory measurement | `utils.py`: `run_speed_benchmark()`, `run_memory_benchmark()` |
-| CLI (overwrite, model choice) | `utils.py`: `parse_benchmark_script_args()` (includes `--model`) |
 | Running the grid and writing CSV | `utils.py`: `run_benchmarks()` |
+| CLI arguments | `utils.py`: `parse_benchmark_script_args()` — provides `--model`, `--overwrite`, `--sweep-mode`, `--bt` |
 
-## 3. Script structure (three parts)
-
-### 3.1 Setup factory
+### 2.1 Setup factory
 
-Define a single **setup function** that builds inputs and the layer (or callable) from `SingleBenchmarkRunInput`, so both speed and memory benchmarks reuse the same setup.
+Define a single **setup function** that builds inputs and the layer from `SingleBenchmarkRunInput`, so both speed and memory benchmarks reuse the same setup.
 
 - **Signature**: `_setup_<kernel>(input: SingleBenchmarkRunInput) -> (tensors, layer_or_fn)`
-- **Input**: `input.x` is the varying dimension (e.g. sequence length); `input.extra_benchmark_config` holds `bsz`, `hidden_size`, `dtype`, etc.; `input.kernel_provider` identifies the implementation variant (e.g. `"liger"`, `"huggingface"`, `"torch"`; values are kernel-specific).
-- **Return**: Whatever the benchmark helpers need (e.g. `(x, layer)` for a single-tensor forward like GEGLU).
-
-Example (conceptually):
+- **Input**: `input.x` is the varying dimension (e.g. seq_len or hidden_size); `input.extra_benchmark_config` holds fixed params like `bsz`, `hidden_size`, `dtype`; `input.kernel_provider` identifies the implementation variant (`"liger"`, `"huggingface"`, `"torch"`, etc.).
 
 ```python
 def _setup_geglu(input: SingleBenchmarkRunInput):
     cfg = input.extra_benchmark_config
-    # Build config, create x tensor, instantiate LigerGEGLUMLP or LlamaMLP by provider
+    # Build model config, create x tensor, instantiate layer by provider
     return x, layer
 ```
 
-### 3.2 Speed and memory benchmark functions
-
-Each takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput` by calling the shared helpers.
+### 2.2 Speed and memory benchmark functions
 
-- **Speed**: `run_speed_benchmark(fwd_fn, mode, input_tensors, rep=...)`
-- **Memory**: `run_memory_benchmark(fwd_fn, mode)`
-- **Modes**: Use `["full", "forward", "backward"]` for both speed and memory for consistency.
-
-Example:
+Each takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput`:
 
 ```python
 def bench_speed_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
@@ -58,44 +51,117 @@ def bench_memory_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutp
     return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode)
 ```
 
-For **scalar output** (e.g. loss) or **multiple outputs** (e.g. RoPE), use the appropriate helpers from `utils.py` if available (e.g. loss or multi-output variants), or implement custom measurement and still use the same setup factory and `run_benchmarks()`.
+- Use `kernel_operation_modes=["full", "forward", "backward"]` for both speed and memory.
+- For **scalar output** (e.g. loss) or **multiple outputs** (e.g. RoPE), implement custom measurement logic but still use the same setup factory and `run_benchmarks()`.
+
+### 2.3 Memory probing
+
+Most scripts should probe peak memory before computing sweep configs:
+
+1. Define a `_probe()` that creates tensors/layers at a small scale and returns the output tensor.
+2. Call `peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)`.
+3. Use `peak_bytes` to derive safe sweep parameters (see sections 3 and 4).
 
-### 3.3 `__main__`: model config, shape computation, run
+Use the **highest-memory baseline** implementation for probing (e.g. `"huggingface"` or `"torch"`) to get a safe upper bound.
 
-1. Parse args: `args = parse_benchmark_script_args()` and resolve `model = get_benchmark_model_config(args.model)`.
-2. (Recommended) Measure peak memory with a small probe using the **highest-memory baseline** implementation (e.g. `"huggingface"` or `"torch"`):
-   - Define a `_probe()` function that creates tensors/layers, runs a forward pass, and returns the output tensor. `_probe()` owns setup; `estimate_kernel_peak_memory` handles memory-stat reset before the call, runs `.backward()`, and performs cleanup (gc + cache clear) afterward.
-   - Call `peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)`.
-3. Compute sweep config (device memory is obtained internally by both helpers):
-   - **Sequence-length sweep** (e.g. GEGLU, SwiGLU): convert peak bytes to per-token (`kernel_bpt = peak_bytes // probe_seq_len`), then `config = compute_seq_len_sweep_config(model, kernel_bytes_per_token=kernel_bpt)`. The returned `SeqLenSweepConfig` has `batch_size` and `seq_len`.
-   - **Hidden-size sweep** (e.g. DyT): pass total peak bytes directly: `config = compute_hidden_size_sweep_config(model, kernel_peak_bytes=peak_bytes, bt=BT)`. The returned `HiddenSizeSweepConfig` has `bt` and `max_hidden_size`.
-4. Build `x_values` from `config.seq_len` (seq_len sweep) or `config.max_hidden_size` (hidden_size sweep).
-5. Build `extra_benchmark_configs` from `model` and config:
-   - Seq_len sweep: e.g. `bsz=config.batch_size`, `hidden_size=model.hidden_size`, `dtype=model.dtype`.
-   - Hidden_size sweep: e.g. `BT=config.bt`, `dtype=model.dtype`.
-6. Call `run_benchmarks(..., kernel_operation_modes=["full", "forward", "backward"], ...)` for both speed and memory.
+## 3. D1 — Non-model dimension sweep
+
+Sweep non-model dimensions (e.g. sequence length, BT) with a **fixed model config**. Use `--model` to select which model.
+
+### 3.1 How to implement
+
+In `__main__`, the `token_length` sweep mode (default) follows this pattern:
+
+1. Parse args and resolve model: `args = parse_benchmark_script_args()`, `model = get_benchmark_model_config(args.model)`.
+2. Probe and compute sweep config:
+   - **seq_len sweep** (GEGLU, SwiGLU, etc.): `kernel_bpt = peak_bytes // probe_seq_len`, then `config = compute_seq_len_sweep_config(model, kernel_bytes_per_token=kernel_bpt)`. Returns `SeqLenSweepConfig` with `batch_size` and `seq_len`.
+   - **BT sweep** (other ops): use `BT` directly as a fixed dimension if no sweep is needed.
+3. Build `x_values` from `config.seq_len` (e.g. `[2**i for i in range(10, log2(config.seq_len) + 1)]`).
+4. Build `extra_benchmark_configs` with fixed model dimensions: `bsz=config.batch_size`, `hidden_size=model.hidden_size`, `dtype=model.dtype`, etc.
+5. Call `run_benchmarks(...)` for both speed and memory.
+
+### 3.2 How to run
+
+```bash
+# Default model (llama_3_8b)
+python benchmark_geglu.py
+
+# Specific model
+python benchmark_geglu.py --model llama_2_7b
+
+# Overwrite existing CSV entries
+python benchmark_geglu.py --model llama_3_8b --overwrite
+```
 
-## 4. CLI
+### 3.3 Reference scripts
 
-Scripts should support:
+- **seq_len sweep**: `benchmark_geglu.py`, `benchmark_swiglu.py` — `compute_seq_len_sweep_config()`
 
-- `--overwrite`: overwrite existing rows in the benchmark CSV.
-- `--model`: model profile name from `MODEL_REGISTRY` (e.g. `llama_2_7b`, `llama_3_8b`). Default when not set is `DEFAULT_MODEL_CONFIG` (e.g. `llama_3_8b`).
+## 4. D2 — Model dimension sweep
 
-These are provided by `parse_benchmark_script_args()` in `utils.py`.
+Sweep across discrete model configs from `MODEL_REGISTRY` with a **fixed token count**. Use `--bt` to set the token count.
 
-## 5. Reference scripts
+### 4.1 Discrete model-config sweep
 
-- **Element-wise (single tensor in/out, seq_len sweep)**: `benchmark_geglu.py`, `benchmark_swiglu.py` — `compute_seq_len_sweep_config()`.
-- **Element-wise (single tensor in/out, hidden_size sweep)**: `benchmark_dyt.py` — `compute_hidden_size_sweep_config()`.
+Sweep across all `MODEL_REGISTRY` entries as discrete data points. Activated by `--sweep-mode model_config`.
+
+**How to implement:**
+
+1. Add a `_resolve_model_config_<kernel>` helper that maps `input.x` (model index) to a standard `SingleBenchmarkRunInput`:
+
+```python
+def _resolve_model_config_geglu(input: SingleBenchmarkRunInput):
+    """Resolve model-config-sweep input into standard setup args."""
+    cfg = input.extra_benchmark_config
+    model_info = cfg["model_configs"][int(input.x)]
+    return _setup_geglu(SingleBenchmarkRunInput(
+        x=cfg["seq_len"],
+        kernel_provider=input.kernel_provider,
+        extra_benchmark_config={
+            "bsz": cfg["bsz"],
+            "hidden_size": model_info["hidden_size"],
+            "intermediate_size": model_info["intermediate_size"],
+            "hidden_act": cfg["hidden_act"],
+            "dtype": model_info["dtype"],
+        },
+    ))
+```
+
+2. Add `bench_speed_<kernel>_model_config` and `bench_memory_<kernel>_model_config`:
+
+```python
+def bench_speed_geglu_model_config(input):
+    x, layer = _resolve_model_config_geglu(input)
+    return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x])
+```
+
+3. In `__main__`, gate on `args.sweep_mode == "model_config"`:
+   - Build `_probe_factory(model_cfg, probe_seq_len)` that returns a probe callable.
+   - Call `sweep = compute_model_config_sweep_config(all_model_configs, probe_fn_factory=..., bt=args.bt)`.
+   - Build `model_configs_info` (list of dicts with each model's dimensions) and pass in `extra_benchmark_configs`.
+   - `x_values = list(range(len(sweep.model_configs)))` (model indices).
+   - Call `run_benchmarks(bench_test_fn=bench_speed_<kernel>_model_config, ...)`.
+
+**Reference**: `benchmark_geglu.py`, `benchmark_swiglu.py`, `benchmark_dyt.py` — all support `--sweep-mode model_config`.
+
+### 4.2 How to run
+
+```bash
+# Discrete model-config sweep with default bt=2048
+python benchmark_geglu.py --sweep-mode model_config
+
+# With custom bt
+python benchmark_geglu.py --sweep-mode model_config --bt 4096
+```
 
-## 6. Checklist for a new script
+## 5. Checklist
 
 - [ ] Script under `benchmark/scripts/` named `benchmark_<kernel>.py`.
 - [ ] Single `_setup_<kernel>(SingleBenchmarkRunInput)` used by both speed and memory.
-- [ ] Speed/memory implemented via `run_speed_benchmark` / `run_memory_benchmark` (or the correct variant for loss / multi-output).
+- [ ] Speed/memory via `run_speed_benchmark` / `run_memory_benchmark` (or custom variant for loss/multi-output).
 - [ ] `kernel_operation_modes=["full", "forward", "backward"]` for both speed and memory.
-- [ ] No hardcoded batch size or sequence length; use `compute_seq_len_sweep_config()` or `compute_hidden_size_sweep_config()` (and optionally `estimate_kernel_peak_memory()`).
+- [ ] No hardcoded batch size or sequence length; sweep configs from `compute_*_sweep_config()` + `estimate_kernel_peak_memory()`.
 - [ ] Model dimensions and dtype from `ModelConfig` / `get_benchmark_model_config()` / `args.model`.
-- [ ] CLI via `parse_benchmark_script_args()` (so `--model` and `--overwrite` work).
-- [ ] Results written through `run_benchmarks()` so data goes to the shared CSV.
+- [ ] CLI via `parse_benchmark_script_args()` (so `--model`, `--overwrite`, `--sweep-mode`, `--bt` all work).
+- [ ] Results written through `run_benchmarks()` to the shared CSV.
+- [ ] Model-config sweep: `_resolve_model_config_<kernel>`, `bench_speed_<kernel>_model_config`, `bench_memory_<kernel>_model_config`, and `__main__` model-config code path.
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -18,31 +18,42 @@ Follow these steps to benchmark and visualize kernel performance:
 3. Visualize results
    - Use the visualization script with optional modes:
 
-     * To target specific mode(s), pass `--kernel-operation-mode` one or more values.
+     * `--sweep-mode`: Select which sweep data to plot.
+       - `token_length` (default): plots where x-axis is sequence length.
+       - `model_config`: plots where x-axis is model configuration.
+     * To target specific operation mode(s), pass `--kernel-operation-mode` one or more values.
      * If you omit `--kernel-operation-mode`, the script will:
        - For `speed` metrics: generate plots for all available modes (forward/backward/full).
        - For `memory` metrics: generate only the `full` plot.
 
    Examples:
-   1. Specific modes (speed):
+   1. Token-length sweep, specific modes (speed):
    ```bash
    python benchmarks_visualizer.py \
        --kernel-name kto_loss \
        --metric-name speed \
        --kernel-operation-mode forward backward
    ```
-   2. All modes (speed):
+   2. Token-length sweep, all modes (speed):
    ```bash
    python benchmarks_visualizer.py \
        --kernel-name kto_loss \
        --metric-name speed
    ```
-   3. Memory (always full):
+   3. Model-config sweep (speed):
+   ```bash
+   python benchmarks_visualizer.py \
+       --kernel-name geglu \
+       --metric-name speed \
+       --sweep-mode model_config
+   ```
+   4. Memory (always full):
    ```bash
    python benchmarks_visualizer.py \
        --kernel-name kto_loss \
        --metric-name memory
    ```
 
 4. View results
-   - Generated plots will be saved in `benchmark/visualizations/`
+   - Generated plots will be saved in `benchmark/visualizations/`
+   - Filenames include the sweep mode when specified (e.g. `geglu_speed_full_model_config.png`)