Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 119 additions & 53 deletions benchmark/BENCHMARK_GUIDELINES.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,45 @@
# Guideline for Adding Benchmark Scripts

This document describes how to add new benchmark scripts to Liger-Kernel in line with the shared framework.

## 1. Where and how to add a script
## 1. Where to add a script

- **Location**: `benchmark/scripts/`
- **Naming**: `benchmark_<kernel_name>.py` (e.g. `benchmark_geglu.py`, `benchmark_swiglu.py`)
- **Naming**: `benchmark_<kernel_name>.py` (e.g. `benchmark_geglu.py`, `benchmark_dyt.py`)

> **Baseline implementations**: Import reference (non-Liger) kernels from the
> test suite (e.g. `test/transformers/test_<kernel>.py`) to use as baselines.
> This keeps benchmark and test implementations in sync and avoids duplicating
> reference code in benchmark scripts.

## 2. Use shared infrastructure
## 2. Shared infrastructure

Do **not** hardcode batch size, sequence length, or model dimensions. Use:
Do **not** hardcode batch size, sequence length, or model dimensions. All benchmark scripts share the following:

| Need | Use |
|------|-----|
| Model dimensions (hidden_size, vocab_size, etc.) | `benchmark_model_configs.py`: `ModelConfig`, `get_benchmark_model_config()` |
| Safe sweep config (seq_len or hidden_size) | `compute_seq_len_sweep_config()` (returns `SeqLenSweepConfig`) or `compute_hidden_size_sweep_config()` (returns `HiddenSizeSweepConfig`), with optional `estimate_kernel_peak_memory()` |
| Model dimensions (hidden_size, vocab_size, etc.) | `benchmark_model_configs.py`: `ModelConfig`, `MODEL_REGISTRY`, `get_benchmark_model_config()` |
| Memory probing | `benchmark_model_configs.py`: `estimate_kernel_peak_memory()` |
| Safe sweep configs | `compute_seq_len_sweep_config()`, `compute_model_config_sweep_config()` |
| Speed / memory measurement | `utils.py`: `run_speed_benchmark()`, `run_memory_benchmark()` |
| CLI (overwrite, model choice) | `utils.py`: `parse_benchmark_script_args()` (includes `--model`) |
| Running the grid and writing CSV | `utils.py`: `run_benchmarks()` |
| CLI arguments | `utils.py`: `parse_benchmark_script_args()` — provides `--model`, `--overwrite`, `--sweep-mode`, `--bt` |

## 3. Script structure (three parts)

### 3.1 Setup factory
### 2.1 Setup factory

Define a single **setup function** that builds inputs and the layer (or callable) from `SingleBenchmarkRunInput`, so both speed and memory benchmarks reuse the same setup.
Define a single **setup function** that builds inputs and the layer from `SingleBenchmarkRunInput`, so both speed and memory benchmarks reuse the same setup.

- **Signature**: `_setup_<kernel>(input: SingleBenchmarkRunInput) -> (tensors, layer_or_fn)`
- **Input**: `input.x` is the varying dimension (e.g. sequence length); `input.extra_benchmark_config` holds `bsz`, `hidden_size`, `dtype`, etc.; `input.kernel_provider` identifies the implementation variant (e.g. `"liger"`, `"huggingface"`, `"torch"`; values are kernel-specific).
- **Return**: Whatever the benchmark helpers need (e.g. `(x, layer)` for a single-tensor forward like GEGLU).

Example (conceptually):
- **Input**: `input.x` is the varying dimension (e.g. seq_len or hidden_size); `input.extra_benchmark_config` holds fixed params like `bsz`, `hidden_size`, `dtype`; `input.kernel_provider` identifies the implementation variant (`"liger"`, `"huggingface"`, `"torch"`, etc.).

```python
def _setup_geglu(input: SingleBenchmarkRunInput):
cfg = input.extra_benchmark_config
# Build config, create x tensor, instantiate LigerGEGLUMLP or LlamaMLP by provider
# Build model config, create x tensor, instantiate layer by provider
return x, layer
```

### 3.2 Speed and memory benchmark functions

Each takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput` by calling the shared helpers.
### 2.2 Speed and memory benchmark functions

- **Speed**: `run_speed_benchmark(fwd_fn, mode, input_tensors, rep=...)`
- **Memory**: `run_memory_benchmark(fwd_fn, mode)`
- **Modes**: Use `["full", "forward", "backward"]` for both speed and memory for consistency.

Example:
Each takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput`:

```python
def bench_speed_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
Expand All @@ -58,44 +51,117 @@ def bench_memory_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutp
return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode)
```

For **scalar output** (e.g. loss) or **multiple outputs** (e.g. RoPE), use the appropriate helpers from `utils.py` if available (e.g. loss or multi-output variants), or implement custom measurement and still use the same setup factory and `run_benchmarks()`.
- Use `kernel_operation_modes=["full", "forward", "backward"]` for both speed and memory.
- For **scalar output** (e.g. loss) or **multiple outputs** (e.g. RoPE), implement custom measurement logic but still use the same setup factory and `run_benchmarks()`.

### 2.3 Memory probing

Most scripts should probe peak memory before computing sweep configs:

1. Define a `_probe()` that creates tensors/layers at a small scale and returns the output tensor.
2. Call `peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)`.
3. Use `peak_bytes` to derive safe sweep parameters (see sections 3 and 4).

### 3.3 `__main__`: model config, shape computation, run
Use the **highest-memory baseline** implementation for probing (e.g. `"huggingface"` or `"torch"`) to get a safe upper bound.

1. Parse args: `args = parse_benchmark_script_args()` and resolve `model = get_benchmark_model_config(args.model)`.
2. (Recommended) Measure peak memory with a small probe using the **highest-memory baseline** implementation (e.g. `"huggingface"` or `"torch"`):
- Define a `_probe()` function that creates tensors/layers, runs a forward pass, and returns the output tensor. `_probe()` owns setup; `estimate_kernel_peak_memory` handles memory-stat reset before the call, runs `.backward()`, and performs cleanup (gc + cache clear) afterward.
- Call `peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)`.
3. Compute sweep config (device memory is obtained internally by both helpers):
- **Sequence-length sweep** (e.g. GEGLU, SwiGLU): convert peak bytes to per-token (`kernel_bpt = peak_bytes // probe_seq_len`), then `config = compute_seq_len_sweep_config(model, kernel_bytes_per_token=kernel_bpt)`. The returned `SeqLenSweepConfig` has `batch_size` and `seq_len`.
- **Hidden-size sweep** (e.g. DyT): pass total peak bytes directly: `config = compute_hidden_size_sweep_config(model, kernel_peak_bytes=peak_bytes, bt=BT)`. The returned `HiddenSizeSweepConfig` has `bt` and `max_hidden_size`.
4. Build `x_values` from `config.seq_len` (seq_len sweep) or `config.max_hidden_size` (hidden_size sweep).
5. Build `extra_benchmark_configs` from `model` and config:
- Seq_len sweep: e.g. `bsz=config.batch_size`, `hidden_size=model.hidden_size`, `dtype=model.dtype`.
- Hidden_size sweep: e.g. `BT=config.bt`, `dtype=model.dtype`.
6. Call `run_benchmarks(..., kernel_operation_modes=["full", "forward", "backward"], ...)` for both speed and memory.
## 3. D1 — Non-model dimension sweep

Sweep non-model dimensions (e.g. sequence length, BT) with a **fixed model config**. Use `--model` to select which model.

### 3.1 How to implement

In `__main__`, the `token_length` sweep mode (default) follows this pattern:

1. Parse args and resolve model: `args = parse_benchmark_script_args()`, `model = get_benchmark_model_config(args.model)`.
2. Probe and compute sweep config:
- **seq_len sweep** (GEGLU, SwiGLU, etc.): `kernel_bpt = peak_bytes // probe_seq_len`, then `config = compute_seq_len_sweep_config(model, kernel_bytes_per_token=kernel_bpt)`. Returns `SeqLenSweepConfig` with `batch_size` and `seq_len`.
- **BT sweep** (other ops): use `BT` directly as a fixed dimension if no sweep is needed.
3. Build `x_values` from `config.seq_len` (e.g. `[2**i for i in range(10, log2(config.seq_len) + 1)]`).
4. Build `extra_benchmark_configs` with fixed model dimensions: `bsz=config.batch_size`, `hidden_size=model.hidden_size`, `dtype=model.dtype`, etc.
5. Call `run_benchmarks(...)` for both speed and memory.

### 3.2 How to run

```bash
# Default model (llama_3_8b)
python benchmark_geglu.py

# Specific model
python benchmark_geglu.py --model llama_2_7b

# Overwrite existing CSV entries
python benchmark_geglu.py --model llama_3_8b --overwrite
```

## 4. CLI
### 3.3 Reference scripts

Scripts should support:
- **seq_len sweep**: `benchmark_geglu.py`, `benchmark_swiglu.py` — `compute_seq_len_sweep_config()`

- `--overwrite`: overwrite existing rows in the benchmark CSV.
- `--model`: model profile name from `MODEL_REGISTRY` (e.g. `llama_2_7b`, `llama_3_8b`). Default when not set is `DEFAULT_MODEL_CONFIG` (e.g. `llama_3_8b`).
## 4. D2 — Model dimension sweep

These are provided by `parse_benchmark_script_args()` in `utils.py`.
Sweep across discrete model configs from `MODEL_REGISTRY` with a **fixed token count**. Use `--bt` to set the token count.

## 5. Reference scripts
### 4.1 Discrete model-config sweep

- **Element-wise (single tensor in/out, seq_len sweep)**: `benchmark_geglu.py`, `benchmark_swiglu.py` — `compute_seq_len_sweep_config()`.
- **Element-wise (single tensor in/out, hidden_size sweep)**: `benchmark_dyt.py` — `compute_hidden_size_sweep_config()`.
Sweep across all `MODEL_REGISTRY` entries as discrete data points. Activated by `--sweep-mode model_config`.

**How to implement:**

1. Add a `_resolve_model_config_<kernel>` helper that maps `input.x` (model index) to a standard `SingleBenchmarkRunInput`:

```python
def _resolve_model_config_geglu(input: SingleBenchmarkRunInput):
"""Resolve model-config-sweep input into standard setup args."""
cfg = input.extra_benchmark_config
model_info = cfg["model_configs"][int(input.x)]
return _setup_geglu(SingleBenchmarkRunInput(
x=cfg["seq_len"],
kernel_provider=input.kernel_provider,
extra_benchmark_config={
"bsz": cfg["bsz"],
"hidden_size": model_info["hidden_size"],
"intermediate_size": model_info["intermediate_size"],
"hidden_act": cfg["hidden_act"],
"dtype": model_info["dtype"],
},
))
```

2. Add `bench_speed_<kernel>_model_config` and `bench_memory_<kernel>_model_config`:

```python
def bench_speed_geglu_model_config(input):
x, layer = _resolve_model_config_geglu(input)
return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x])
```

3. In `__main__`, gate on `args.sweep_mode == "model_config"`:
- Build `_probe_factory(model_cfg, probe_seq_len)` that returns a probe callable.
- Call `sweep = compute_model_config_sweep_config(all_model_configs, probe_fn_factory=..., bt=args.bt)`.
- Build `model_configs_info` (list of dicts with each model's dimensions) and pass in `extra_benchmark_configs`.
- `x_values = list(range(len(sweep.model_configs)))` (model indices).
- Call `run_benchmarks(bench_test_fn=bench_speed_<kernel>_model_config, ...)`.

**Reference**: `benchmark_geglu.py`, `benchmark_swiglu.py`, `benchmark_dyt.py` — all support `--sweep-mode model_config`.

### 4.2 How to run

```bash
# Discrete model-config sweep with default bt=2048
python benchmark_geglu.py --sweep-mode model_config

# With custom bt
python benchmark_geglu.py --sweep-mode model_config --bt 4096
```

## 6. Checklist for a new script
## 5. Checklist

- [ ] Script under `benchmark/scripts/` named `benchmark_<kernel>.py`.
- [ ] Single `_setup_<kernel>(SingleBenchmarkRunInput)` used by both speed and memory.
- [ ] Speed/memory implemented via `run_speed_benchmark` / `run_memory_benchmark` (or the correct variant for loss / multi-output).
- [ ] Speed/memory via `run_speed_benchmark` / `run_memory_benchmark` (or custom variant for loss/multi-output).
- [ ] `kernel_operation_modes=["full", "forward", "backward"]` for both speed and memory.
- [ ] No hardcoded batch size or sequence length; use `compute_seq_len_sweep_config()` or `compute_hidden_size_sweep_config()` (and optionally `estimate_kernel_peak_memory()`).
- [ ] No hardcoded batch size or sequence length; sweep configs from `compute_*_sweep_config()` + `estimate_kernel_peak_memory()`.
- [ ] Model dimensions and dtype from `ModelConfig` / `get_benchmark_model_config()` / `args.model`.
- [ ] CLI via `parse_benchmark_script_args()` (so `--model` and `--overwrite` work).
- [ ] Results written through `run_benchmarks()` so data goes to the shared CSV.
- [ ] CLI via `parse_benchmark_script_args()` (so `--model`, `--overwrite`, `--sweep-mode`, `--bt` all work).
- [ ] Results written through `run_benchmarks()` to the shared CSV.
- [ ] Model-config sweep: `_resolve_model_config_<kernel>`, `bench_speed_<kernel>_model_config`, `bench_memory_<kernel>_model_config`, and `__main__` model-config code path.
21 changes: 16 additions & 5 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,31 +18,42 @@ Follow these steps to benchmark and visualize kernel performance:
3. Visualize results
- Use the visualization script with optional modes:

* To target specific mode(s), pass `--kernel-operation-mode` one or more values.
* `--sweep-mode`: Select which sweep data to plot.
- `token_length` (default): plots where x-axis is sequence length.
- `model_config`: plots where x-axis is model configuration.
* To target specific operation mode(s), pass `--kernel-operation-mode` one or more values.
* If you omit `--kernel-operation-mode`, the script will:
- For `speed` metrics: generate plots for all available modes (forward/backward/full).
- For `memory` metrics: generate only the `full` plot.

Examples:
1. Specific modes (speed):
1. Token-length sweep, specific modes (speed):
```bash
python benchmarks_visualizer.py \
--kernel-name kto_loss \
--metric-name speed \
--kernel-operation-mode forward backward
```
2. All modes (speed):
2. Token-length sweep, all modes (speed):
```bash
python benchmarks_visualizer.py \
--kernel-name kto_loss \
--metric-name speed
```
3. Memory (always full):
3. Model-config sweep (speed):
```bash
python benchmarks_visualizer.py \
--kernel-name geglu \
--metric-name speed \
--sweep-mode model_config
```
4. Memory (always full):
```bash
python benchmarks_visualizer.py \
--kernel-name kto_loss \
--metric-name memory
```

4. View results
- Generated plots will be saved in `benchmark/visualizations/`
- Generated plots will be saved in `benchmark/visualizations/`
- Filenames include the sweep mode when specified (e.g. `geglu_speed_full_model_config.png`)
Loading