Skip to content

Commit 0e54614

Browse files
authored
Merge branch 'main' into chunked_grpo_streaming_origin_main
2 parents 0248d72 + d8d6630 commit 0e54614

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+5381
-414
lines changed

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,7 @@ loss.backward()
263263
| Qwen2.5-VL | `liger_kernel.transformers.apply_liger_kernel_to_qwen2_5_vl` | RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
264264
| Qwen3 | `liger_kernel.transformers.apply_liger_kernel_to_qwen3` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
265265
| Qwen3 MoE | `liger_kernel.transformers.apply_liger_kernel_to_qwen3_moe` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
266+
| Qwen3.5 | `liger_kernel.transformers.apply_liger_kernel_to_qwen3_5` | RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
266267
| Phi3 & Phi3.5 | `liger_kernel.transformers.apply_liger_kernel_to_phi3` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
267268
| Granite 3.0 & 3.1 | `liger_kernel.transformers.apply_liger_kernel_to_granite` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss |
268269
| OLMo2 | `liger_kernel.transformers.apply_liger_kernel_to_olmo2` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
@@ -365,6 +366,11 @@ loss.backward()
365366
<img src="https://github.com/linkedin/Liger-Kernel/actions/workflows/intel-ci.yml/badge.svg?branch=main&event=push" alt="Build">
366367
</a>
367368
</div>
369+
<div style="display: block;">
370+
<a href="https://github.com/xuedinge233/Liger-Kernel/actions/workflows/ascend_npu_ci.yml">
371+
<img src="https://github.com/xuedinge233/Liger-Kernel/actions/workflows/ascend_npu_ci.yml/badge.svg?branch=main" alt="Build">
372+
</a>
373+
</div>
368374
</td>
369375
</tr>
370376
</table>

benchmark/BENCHMARK_GUIDELINES.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Guideline for Adding Benchmark Scripts
2+
3+
This document describes how to add new benchmark scripts to Liger-Kernel in line with the shared framework.
4+
5+
## 1. Where and how to add a script
6+
7+
- **Location**: `benchmark/scripts/`
8+
- **Naming**: `benchmark_<kernel_name>.py` (e.g. `benchmark_geglu.py`, `benchmark_swiglu.py`)
9+
10+
## 2. Use shared infrastructure
11+
12+
Do **not** hardcode batch size, sequence length, or model dimensions. Use:
13+
14+
| Need | Use |
15+
|------|-----|
16+
| Model dimensions (hidden_size, vocab_size, etc.) | `benchmark_model_configs.py`: `ModelConfig`, `get_benchmark_model_config()` |
17+
| Safe sweep config (seq_len or hidden_size) | `compute_seq_len_sweep_config()` (returns `SeqLenSweepConfig`) or `compute_hidden_size_sweep_config()` (returns `HiddenSizeSweepConfig`), with optional `estimate_kernel_peak_memory()` |
18+
| Speed / memory measurement | `utils.py`: `run_speed_benchmark()`, `run_memory_benchmark()` |
19+
| CLI (overwrite, model choice) | `utils.py`: `parse_benchmark_script_args()` (includes `--model`) |
20+
| Running the grid and writing CSV | `utils.py`: `run_benchmarks()` |
21+
22+
## 3. Script structure (three parts)
23+
24+
### 3.1 Setup factory
25+
26+
Define a single **setup function** that builds inputs and the layer (or callable) from `SingleBenchmarkRunInput`, so both speed and memory benchmarks reuse the same setup.
27+
28+
- **Signature**: `_setup_<kernel>(input: SingleBenchmarkRunInput) -> (tensors, layer_or_fn)`
29+
- **Input**: `input.x` is the varying dimension (e.g. sequence length); `input.extra_benchmark_config` holds `bsz`, `hidden_size`, `dtype`, etc.; `input.kernel_provider` identifies the implementation variant (e.g. `"liger"`, `"huggingface"`, `"torch"`; values are kernel-specific).
30+
- **Return**: Whatever the benchmark helpers need (e.g. `(x, layer)` for a single-tensor forward like GEGLU).
31+
32+
Example (conceptually):
33+
34+
```python
35+
def _setup_geglu(input: SingleBenchmarkRunInput):
36+
cfg = input.extra_benchmark_config
37+
# Build config, create x tensor, instantiate LigerGEGLUMLP or LlamaMLP by provider
38+
return x, layer
39+
```
40+
41+
### 3.2 Speed and memory benchmark functions
42+
43+
Each takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput` by calling the shared helpers.
44+
45+
- **Speed**: `run_speed_benchmark(fwd_fn, mode, input_tensors, rep=...)`
46+
- **Memory**: `run_memory_benchmark(fwd_fn, mode)`
47+
- **Modes**: Use `["full", "forward", "backward"]` for both speed and memory for consistency.
48+
49+
Example:
50+
51+
```python
52+
def bench_speed_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
53+
x, layer = _setup_geglu(input)
54+
return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x])
55+
56+
def bench_memory_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
57+
x, layer = _setup_geglu(input)
58+
return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode)
59+
```
60+
61+
For **scalar output** (e.g. loss) or **multiple outputs** (e.g. RoPE), use the appropriate helpers from `utils.py` if available (e.g. loss or multi-output variants), or implement custom measurement and still use the same setup factory and `run_benchmarks()`.
62+
63+
### 3.3 `__main__`: model config, shape computation, run
64+
65+
1. Parse args: `args = parse_benchmark_script_args()` and resolve `model = get_benchmark_model_config(args.model)`.
66+
2. (Recommended) Measure peak memory with a small probe using the **highest-memory baseline** implementation (e.g. `"huggingface"` or `"torch"`):
67+
- Define a `_probe()` function that creates tensors/layers, runs a forward pass, and returns the output tensor. `_probe()` owns setup; `estimate_kernel_peak_memory` handles memory-stat reset before the call, runs `.backward()`, and performs cleanup (gc + cache clear) afterward.
68+
- Call `peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)`.
69+
3. Compute sweep config (device memory is obtained internally by both helpers):
70+
- **Sequence-length sweep** (e.g. GEGLU, SwiGLU): convert peak bytes to per-token (`kernel_bpt = peak_bytes // probe_seq_len`), then `config = compute_seq_len_sweep_config(model, kernel_bytes_per_token=kernel_bpt)`. The returned `SeqLenSweepConfig` has `batch_size` and `seq_len`.
71+
- **Hidden-size sweep** (e.g. DyT): pass total peak bytes directly: `config = compute_hidden_size_sweep_config(model, kernel_peak_bytes=peak_bytes, bt=BT)`. The returned `HiddenSizeSweepConfig` has `bt` and `max_hidden_size`.
72+
4. Build `x_values` from `config.seq_len` (seq_len sweep) or `config.max_hidden_size` (hidden_size sweep).
73+
5. Build `extra_benchmark_configs` from `model` and config:
74+
- Seq_len sweep: e.g. `bsz=config.batch_size`, `hidden_size=model.hidden_size`, `dtype=model.dtype`.
75+
- Hidden_size sweep: e.g. `BT=config.bt`, `dtype=model.dtype`.
76+
6. Call `run_benchmarks(..., kernel_operation_modes=["full", "forward", "backward"], ...)` for both speed and memory.
77+
78+
## 4. CLI
79+
80+
Scripts should support:
81+
82+
- `--overwrite`: overwrite existing rows in the benchmark CSV.
83+
- `--model`: model profile name from `MODEL_REGISTRY` (e.g. `llama_2_7b`, `llama_3_8b`). Default when not set is `DEFAULT_MODEL_CONFIG` (e.g. `llama_3_8b`).
84+
85+
These are provided by `parse_benchmark_script_args()` in `utils.py`.
86+
87+
## 5. Reference scripts
88+
89+
- **Element-wise (single tensor in/out, seq_len sweep)**: `benchmark_geglu.py`, `benchmark_swiglu.py``compute_seq_len_sweep_config()`.
90+
- **Element-wise (single tensor in/out, hidden_size sweep)**: `benchmark_dyt.py``compute_hidden_size_sweep_config()`.
91+
92+
## 6. Checklist for a new script
93+
94+
- [ ] Script under `benchmark/scripts/` named `benchmark_<kernel>.py`.
95+
- [ ] Single `_setup_<kernel>(SingleBenchmarkRunInput)` used by both speed and memory.
96+
- [ ] Speed/memory implemented via `run_speed_benchmark` / `run_memory_benchmark` (or the correct variant for loss / multi-output).
97+
- [ ] `kernel_operation_modes=["full", "forward", "backward"]` for both speed and memory.
98+
- [ ] No hardcoded batch size or sequence length; use `compute_seq_len_sweep_config()` or `compute_hidden_size_sweep_config()` (and optionally `estimate_kernel_peak_memory()`).
99+
- [ ] Model dimensions and dtype from `ModelConfig` / `get_benchmark_model_config()` / `args.model`.
100+
- [ ] CLI via `parse_benchmark_script_args()` (so `--model` and `--overwrite` work).
101+
- [ ] Results written through `run_benchmarks()` so data goes to the shared CSV.

benchmark/scripts/benchmark_dyt.py

Lines changed: 43 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,16 @@
22
import sys
33

44
import torch
5-
import triton
65

7-
from utils import QUANTILES
6+
from benchmark_model_configs import compute_hidden_size_sweep_config
7+
from benchmark_model_configs import estimate_kernel_peak_memory
8+
from benchmark_model_configs import get_benchmark_model_config
89
from utils import SingleBenchmarkRunInput
910
from utils import SingleBenchmarkRunOutput
10-
from utils import _test_memory
1111
from utils import parse_benchmark_script_args
1212
from utils import run_benchmarks
13+
from utils import run_memory_benchmark
14+
from utils import run_speed_benchmark
1315

1416
from liger_kernel.utils import infer_device
1517

@@ -18,124 +20,76 @@
1820
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
1921

2022

21-
def bench_speed_dyt(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
23+
def _setup_dyt(input: SingleBenchmarkRunInput):
24+
"""Create input tensor and DyT layer from benchmark config."""
2225
from test.transformers.test_dyt import LigerDyT
2326
from test.transformers.test_dyt import TorchDyT
2427

28+
cfg = input.extra_benchmark_config
2529
hidden_size = input.x
26-
provider = input.kernel_provider
27-
mode = input.kernel_operation_mode
28-
extra_benchmark_config = input.extra_benchmark_config
29-
BT = extra_benchmark_config["BT"]
30-
beta = extra_benchmark_config["beta"]
31-
dtype = extra_benchmark_config["dtype"]
32-
33-
x_shape = (BT, hidden_size)
34-
torch_dyt = TorchDyT(hidden_size=hidden_size, beta=beta).to(device)
35-
torch_compile_dyt = torch.compile(TorchDyT(hidden_size=hidden_size, beta=beta).to(device))
36-
triton_dyt = LigerDyT(hidden_size=hidden_size, beta=beta).to(device)
37-
38-
x = torch.randn(x_shape, dtype=dtype, device=device)
39-
dy = torch.randn_like(x)
40-
x.requires_grad_(True)
41-
42-
def fwd():
43-
if provider == "liger":
44-
return triton_dyt(x)
45-
elif provider == "torch":
46-
return torch_dyt(x)
47-
elif provider == "torch_compile":
48-
return torch_compile_dyt(x)
49-
50-
if mode == "forward":
51-
ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd, quantiles=QUANTILES, grad_to_none=[x], rep=500)
52-
elif mode == "backward":
53-
y = fwd()
54-
ms_50, ms_20, ms_80 = triton.testing.do_bench(
55-
lambda: y.backward(dy, retain_graph=True),
56-
quantiles=QUANTILES,
57-
grad_to_none=[x],
58-
rep=500,
59-
)
60-
elif mode == "full":
61-
62-
def full():
63-
y = fwd()
64-
y.backward(dy)
30+
x = torch.randn(cfg["BT"], hidden_size, device=device, dtype=cfg["dtype"], requires_grad=True)
31+
if input.kernel_provider == "liger":
32+
layer = LigerDyT(hidden_size=hidden_size, beta=cfg["beta"]).to(device)
33+
elif input.kernel_provider == "torch":
34+
layer = TorchDyT(hidden_size=hidden_size, beta=cfg["beta"]).to(device)
35+
elif input.kernel_provider == "torch_compile":
36+
layer = torch.compile(TorchDyT(hidden_size=hidden_size, beta=cfg["beta"]).to(device))
37+
else:
38+
raise ValueError(f"Invalid provider: {input.kernel_provider} for DyT")
39+
return x, layer
6540

66-
ms_50, ms_20, ms_80 = triton.testing.do_bench(full, quantiles=QUANTILES, grad_to_none=[x], rep=500)
6741

68-
return SingleBenchmarkRunOutput(
69-
y_20=ms_20,
70-
y_50=ms_50,
71-
y_80=ms_80,
72-
)
42+
def bench_speed_dyt(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
43+
x, layer = _setup_dyt(input)
44+
return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x])
7345

7446

7547
def bench_memory_dyt(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
76-
from test.transformers.test_dyt import LigerDyT
77-
from test.transformers.test_dyt import TorchDyT
48+
x, layer = _setup_dyt(input)
49+
return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode)
7850

79-
hidden_size = input.x
80-
provider = input.kernel_provider
81-
extra_benchmark_config = input.extra_benchmark_config
82-
BT = extra_benchmark_config["BT"]
83-
beta = extra_benchmark_config["beta"]
84-
dtype = extra_benchmark_config["dtype"]
85-
86-
x_shape = (BT, hidden_size)
87-
torch_dyt = TorchDyT(hidden_size=hidden_size, beta=beta).to(device)
88-
torch_compile_dyt = torch.compile(TorchDyT(hidden_size=hidden_size, beta=beta).to(device))
89-
triton_dyt = LigerDyT(hidden_size=hidden_size, beta=beta).to(device)
90-
91-
x = torch.randn(x_shape, dtype=dtype, device=device)
92-
dy = torch.randn_like(x)
93-
x.requires_grad_(True)
94-
95-
def fwd():
96-
if provider == "liger":
97-
return triton_dyt(x)
98-
elif provider == "torch":
99-
return torch_dyt(x)
100-
elif provider == "torch_compile":
101-
return torch_compile_dyt(x)
102-
103-
def full():
104-
y = fwd()
105-
y.backward(dy, retain_graph=True)
106-
107-
mem_50, mem_20, mem_80 = _test_memory(full, quantiles=QUANTILES)
108-
return SingleBenchmarkRunOutput(
109-
y_20=mem_20,
110-
y_50=mem_50,
111-
y_80=mem_80,
112-
)
11351

52+
BT = 4096
11453

11554
if __name__ == "__main__":
11655
args = parse_benchmark_script_args()
56+
model = get_benchmark_model_config(args.model)
11757

11858
for beta in [False, True]:
59+
60+
def _probe():
61+
probe_input = SingleBenchmarkRunInput(
62+
x=model.hidden_size,
63+
kernel_provider="torch",
64+
extra_benchmark_config={"BT": BT, "dtype": model.dtype, "beta": beta},
65+
)
66+
x, layer = _setup_dyt(probe_input)
67+
return layer(x)
68+
69+
peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe)
70+
sweep_config = compute_hidden_size_sweep_config(model, peak_bytes, bt=BT)
71+
x_values = [1024 * i for i in range(1, 17) if 1024 * i <= sweep_config.max_hidden_size] or [model.hidden_size]
72+
11973
common_configs = {
12074
"kernel_name": f"dyt_beta={beta}",
12175
"x_name": "hidden_size",
12276
"x_label": "hidden_size",
123-
"x_values": [1024 * i for i in range(1, 17)],
77+
"x_values": x_values,
12478
"kernel_providers": ["liger", "torch", "torch_compile"],
125-
"extra_benchmark_configs": [{"BT": 4096, "dtype": torch.bfloat16, "beta": beta}],
79+
"extra_benchmark_configs": [{"BT": sweep_config.bt, "dtype": model.dtype, "beta": beta}],
12680
"overwrite": args.overwrite,
12781
}
12882

12983
run_benchmarks(
13084
bench_test_fn=bench_speed_dyt,
131-
kernel_operation_modes=["forward", "backward", "full"],
85+
kernel_operation_modes=["full", "forward", "backward"],
13286
metric_name="speed",
13387
metric_unit="ms",
13488
**common_configs,
13589
)
13690
run_benchmarks(
13791
bench_test_fn=bench_memory_dyt,
138-
kernel_operation_modes=["full"],
92+
kernel_operation_modes=["full", "forward", "backward"],
13993
metric_name="memory",
14094
metric_unit="MB",
14195
**common_configs,

0 commit comments

Comments
 (0)