[Benchmark]: Add --sweep-mode and --bt to benchmark CLI. by noemotiovon · Pull Request #1163 · linkedin/Liger-Kernel

noemotiovon · 2026-03-24T08:43:38Z

This PR follows PR1162 and implements Phase 2.

benchmark_model_configs: replace hidden-size sweep with compute_model_config_sweep_config / ModelConfigSweepConfig; probe each registry model to pick safe batch_size and seq_len for discrete sweeps.
benchmark_geglu / benchmark_swiglu: support model_config sweep across MODEL_REGISTRY via resolve_model_config* helpers.
benchmark_dyt: default path sweeps B*T with fixed model dimensions (compute_seq_len_sweep_config); optional model_config sweep; setup uses cfg hidden_size and input.x as BT.
utils: allow string x / x_values for model name indices; extend types.
benchmarks_visualizer: forward extra_config_filter to plotting.
BENCHMARK_GUIDELINES: document D1/D2 sweep patterns and model_config flow.

Hardware Type: Atlas 800I A2

run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

noemotiovon · 2026-03-24T08:52:45Z

Benchmark Framework Design

This document describes the overall design of the Liger-Kernel benchmark suite, including its two benchmark dimensions, the shared infrastructure, and the phased implementation plan.

1. Benchmark Dimensions

Every operator should ideally be benchmarked along two orthogonal dimensions:

Dimension	x-axis	Fixed	CLI	Goal
D1: Non-model dimension sweep	sequence length, BT, etc.	model config	`--model`	Performance scaling across different input sizes
D2: Model dimension sweep	hidden_size, model configs, etc.	token count	`--bt`	Performance scaling across different model architectures

D1: Non-model dimension sweep (implemented)

Sweep non-model dimensions (e.g. sequence length, BT) with a fixed model config selected via --model. This is the default behavior for all benchmark scripts.

x_values:  [1024, 2048, 4096, 8192, ...]     (seq_len or BT)
fixed:     model=llama_3_8b via --model       (hidden_size=4096, intermediate_size=14336, ...)
output:    line chart — speed/memory vs token length

D2: Model dimension sweep (implemented)

Sweep model architecture dimensions (e.g. hidden_size, or discrete model configs from MODEL_REGISTRY) with a fixed token count set via --bt. This reveals how kernel performance compares across different model architectures at the same input scale.

x_values:  [llama_2_7b, llama_3_8b, ...]     (discrete model configs)
fixed:     BxT via --bt                       (determined to be safe across all configs)
output:    speedup or throughput bar chart per model config

2. D2 Design Choices

Following the maintainer discussion, we evaluated three approaches:

Approach	Description	Pros	Cons
A: Per-parameter sweep	Fix all but one model param, sweep it (e.g. sweep hidden_size with fixed intermediate_size, then vice versa)	Shows per-parameter scaling trend	Combinatorial; fixed values are arbitrary; model-dependent
B: N-dimensional scan	Vary all model parameters simultaneously	Most comprehensive data	Impractical runtime; data bloat
C: Discrete model configs	Run each `MODEL_REGISTRY` entry as one data point	Cost-efficient; realistic configs; easy to maintain	No continuous scaling trend

Decision: C as the primary approach, with A as optional enrichment for ops where single-parameter scaling is important.

Rationale:

C uses real-world model architectures, making results directly meaningful.
C naturally extends the existing MODEL_REGISTRY infrastructure.
C produces clean bar charts (speedup/throughput) that align across ops.
A can be layered on later for specific ops where parameter-level scaling trends matter.

3. Universal Token Length for D2

For D2 benchmarks, we need a fixed token-length that is safe (no OOM) across all model configs and all operators.

Strategy

Coupled tokens: define a (batch_size, seq_len) pair per model config, e.g. (B=2, T=1024) for llama_2_7b, (B=1, T=1024) for llama_3_8b. This allows adapting to each model's memory footprint while keeping the comparison fair (same total token count or same seq_len).
Safety via probe: before running D2 benchmarks, run estimate_kernel_peak_memory for each (model_config, token_config) pair. If any config would OOM, reduce token count automatically.
Forward compatibility: when new ops are added, the probe mechanism ensures safety without manual tuning. If a new op cannot fit any reasonable token count for a given model, the framework skips that data point with a warning rather than crashing.

Proposed CLI

# D1 (existing): token-length sweep with fixed model
python benchmark_geglu.py --model llama_3_8b

# D2 (new): model-config sweep with fixed token length
python benchmark_geglu.py --sweep-mode model_config --bt 2048

The --sweep-mode flag selects the dimension. Default remains token_length (D1) for backward compatibility.

4. Infrastructure Changes

4.1 New config type

@dataclass(frozen=True)
class ModelConfigSweepConfig:
    """Config for D2 benchmarks that sweep across model configs."""
    model_configs: List[ModelConfig]       # models to benchmark
    bt: int                                 # fixed batch * seq_len
    batch_size: int                         # safe batch size
    seq_len: int                            # safe seq_len

4.2 New helper

def compute_model_config_sweep_config(
    model_configs: List[ModelConfig],
    probe_fn_factory: Callable[[ModelConfig, int], Callable[[], torch.Tensor]],
    bt: int = 2048,
    memory_utilization: float = 0.4,
) -> ModelConfigSweepConfig:
    """Find safe (batch_size, seq_len) that works across all model configs.

    For each model config, runs probe_fn_factory(model_config, bt) to measure
    peak memory, then picks the most conservative batch_size / seq_len.
    """
    ...

4.3 Script-level changes

Each benchmark script gains a model-config sweep code path gated by --sweep-mode:

if args.sweep_mode == "model_config":
    configs = [MODEL_REGISTRY[name] for name in MODEL_REGISTRY]
    sweep = compute_model_config_sweep_config(configs, probe_fn_factory=..., bt=args.bt)
    # x_values = model config indices
    # extra_benchmark_configs = contains all model configs
    ...
else:
    # existing token-length sweep logic
    ...

4.4 Visualization

D2 results produce grouped bar charts (speedup or throughput) rather than line charts:

x-axis: model config names (e.g. llama_2_7b, llama_3_8b)
bars: kernel providers (liger vs huggingface/torch)
y-axis: speedup ratio or throughput (tokens/s)

5. Phased Implementation Plan

Phase 1: Foundation (current PR)

Status: complete

ModelConfig and MODEL_REGISTRY with canonical model profiles
estimate_kernel_peak_memory() for runtime memory probing
compute_seq_len_sweep_config() for D1 seq_len sweeps (non-model dimension)
compute_hidden_size_sweep_config() for D2 hidden_size sweeps (model dimension)
run_speed_benchmark() / run_memory_benchmark() shared helpers
--model CLI argument
Refactored benchmark_geglu.py, benchmark_swiglu.py, benchmark_dyt.py
BENCHMARK_GUIDELINES.md contributor guide

Phase 2: Model-config sweep (D2)

Status: complete

Add ModelConfigSweepConfig dataclass
Implement compute_model_config_sweep_config() with cross-model probe
Add --sweep-mode and --bt CLI arguments to parse_benchmark_script_args()
Add model-config sweep code path to benchmark_geglu.py as reference implementation
Model-config sweep code path ported to benchmark_swiglu.py and benchmark_dyt.py
Validate on at least 2 devices (CUDA + NPU) to confirm OOM safety
Update BENCHMARK_GUIDELINES.md with D2 instructions

Phase 3: Rollout and visualization

Status: planned

Port D2 support to all existing benchmark scripts
Add bar chart / speedup visualization for D2 results
Integrate into CI workflows
Add more model profiles to MODEL_REGISTRY (e.g. Qwen-2.5-7B)

6. Directory Structure

benchmark/
├── data/
│   └── all_benchmark_data.csv
├── scripts/
│   ├── benchmark_model_configs.py      # ModelConfig, MODEL_REGISTRY, helpers
│   ├── utils.py                        # run_benchmarks, CSV, CLI
│   ├── benchmark_geglu.py              # D1 + D2
│   ├── benchmark_swiglu.py             # D1 + D2
│   ├── benchmark_dyt.py                # D1 + D2
│   └── ...
├── visualize/                          # (Phase 3) chart generation
│   └── ...
└── BENCHMARK_GUIDELINES.md             # contributor guide

noemotiovon · 2026-03-25T06:57:31Z

GEGLU Test:

Script:

# D1
python scripts/benchmark_geglu.py --overwrite
python benchmarks_visualizer.py --kernel-name geglu --metric-name speed --overwrite
# D2
python scripts/benchmark_geglu.py --sweep-mode model_config --overwrite
python benchmarks_visualizer.py --kernel-name geglu --metric-name speed --overwrite --sweep-mode model_config

Result:

noemotiovon · 2026-03-25T07:10:32Z

Swiglu Test:

Script:

# D1
python scripts/benchmark_swiglu.py --overwrite
python benchmarks_visualizer.py --kernel-name swiglu --metric-name speed --overwrite
# D2
python scripts/benchmark_swiglu.py --sweep-mode model_config --overwrite
python benchmarks_visualizer.py --kernel-name swiglu --metric-name speed --overwrite --sweep-mode model_config

Result:

noemotiovon · 2026-03-25T07:20:20Z

Dyt Test:

Script:

export TORCH_COMPILE_DISABLE=1
# D1
python scripts/benchmark_dyt.py --overwrite
python benchmarks_visualizer.py --kernel-name dyt_beta=False --metric-name speed --overwrite
# D2
python scripts/benchmark_dyt.py --sweep-mode model_config --overwrite
python benchmarks_visualizer.py --kernel-name dyt_beta=False --metric-name speed --overwrite --sweep-mode model_config

Result:

noemotiovon · 2026-03-25T07:35:12Z

Hi @Tcc0403, can you take a look at my code

benchmark/BENCHMARK_GUIDELINES.md

benchmark/README.md

benchmark/benchmarks_visualizer.py

benchmark/BENCHMARK_GUIDELINES.md

- benchmark_model_configs: replace hidden-size sweep with compute_model_config_sweep_config / ModelConfigSweepConfig; probe each registry model to pick safe batch_size and seq_len for discrete sweeps. - benchmark_geglu / benchmark_swiglu: support model_config sweep across MODEL_REGISTRY via _resolve_model_config_* helpers. - benchmark_dyt: default path sweeps B*T with fixed model dimensions (compute_seq_len_sweep_config); optional model_config sweep; setup uses cfg hidden_size and input.x as BT. - utils: allow string x / x_values for model name indices; extend types. - benchmarks_visualizer: forward extra_config_filter to plotting. - BENCHMARK_GUIDELINES: document D1/D2 sweep patterns and model_config flow.

…aling - Add --sweep-mode argument (token_length|model_config) to benchmarks_visualizer.py for filtering data by sweep type via the x_name column in CSV, defaulting to token_length - Fix x-axis scaling: convert numeric x_values to proper numeric type so matplotlib plots them proportionally instead of equally spaced; string x_values (e.g. model names) remain categorical - Set tick labels only at actual data points for numeric axes - Include sweep_mode suffix in output PNG filenames to avoid overwriting when both sweep types exist for the same kernel - Update README.md with --sweep-mode usage and examples

benchmarks_visualizer.py: - Add `--gpu-filter` CLI flag to select a specific GPU when benchmark data contains results from multiple devices; falls back to the most recent device with a warning when omitted or unmatched. - Extract `gpu_name_filter()` and `extra_config_filter()` as standalone helpers; `load_data()` now applies filters in explicit order: kernel/metric/mode → sweep-mode → GPU → extra config. BENCHMARK_GUIDELINES.md: - Add guideline: import baseline kernels from the test suite instead of duplicating reference implementations in benchmark scripts. - Remove the continuous hidden-size sweep variant (D2.1) and `compute_hidden_size_sweep_config()` reference; D2 now covers only the discrete model-config sweep. Co-authored-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>

noemotiovon · 2026-03-27T07:32:14Z

Hi @Tcc0403, I’ve made updates according to the review comments. Happy to discuss further!

noemotiovon marked this pull request as ready for review March 25, 2026 07:34

Tcc0403 reviewed Mar 26, 2026

View reviewed changes

noemotiovon and others added 3 commits March 27, 2026 03:00

noemotiovon force-pushed the benchmark branch from 3f57968 to c2bf669 Compare March 27, 2026 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark]: Add --sweep-mode and --bt to benchmark CLI.#1163

[Benchmark]: Add --sweep-mode and --bt to benchmark CLI.#1163
noemotiovon wants to merge 3 commits intolinkedin:mainfrom
noemotiovon:benchmark

noemotiovon commented Mar 24, 2026 •

edited

Loading

Uh oh!

noemotiovon commented Mar 24, 2026

Uh oh!

noemotiovon commented Mar 25, 2026 •

edited

Loading

Uh oh!

noemotiovon commented Mar 25, 2026 •

edited

Loading

Uh oh!

noemotiovon commented Mar 25, 2026 •

edited

Loading

Uh oh!

noemotiovon commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

noemotiovon commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noemotiovon commented Mar 24, 2026

Benchmark Framework Design

1. Benchmark Dimensions

D1: Non-model dimension sweep (implemented)

D2: Model dimension sweep (implemented)

2. D2 Design Choices

3. Universal Token Length for D2

Strategy

Proposed CLI

4. Infrastructure Changes

4.1 New config type

4.2 New helper

4.3 Script-level changes

4.4 Visualization

5. Phased Implementation Plan

Phase 1: Foundation (current PR)

Phase 2: Model-config sweep (D2)

Phase 3: Rollout and visualization

6. Directory Structure

Uh oh!

noemotiovon commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GEGLU Test:

Uh oh!

noemotiovon commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Swiglu Test:

Uh oh!

noemotiovon commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dyt Test:

Uh oh!

noemotiovon commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noemotiovon commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noemotiovon commented Mar 24, 2026 •

edited

Loading

noemotiovon commented Mar 25, 2026 •

edited

Loading

noemotiovon commented Mar 25, 2026 •

edited

Loading

noemotiovon commented Mar 25, 2026 •

edited

Loading