Skip to content

[Benchmark]: Add --sweep-mode and --bt to benchmark CLI.#1163

Open
noemotiovon wants to merge 3 commits intolinkedin:mainfrom
noemotiovon:benchmark
Open

[Benchmark]: Add --sweep-mode and --bt to benchmark CLI.#1163
noemotiovon wants to merge 3 commits intolinkedin:mainfrom
noemotiovon:benchmark

Conversation

@noemotiovon
Copy link
Copy Markdown
Contributor

@noemotiovon noemotiovon commented Mar 24, 2026

This PR follows PR1162 and implements Phase 2.

  • benchmark_model_configs: replace hidden-size sweep with compute_model_config_sweep_config / ModelConfigSweepConfig; probe each registry model to pick safe batch_size and seq_len for discrete sweeps.
  • benchmark_geglu / benchmark_swiglu: support model_config sweep across MODEL_REGISTRY via resolve_model_config* helpers.
  • benchmark_dyt: default path sweeps B*T with fixed model dimensions (compute_seq_len_sweep_config); optional model_config sweep; setup uses cfg hidden_size and input.x as BT.
  • utils: allow string x / x_values for model name indices; extend types.
  • benchmarks_visualizer: forward extra_config_filter to plotting.
  • BENCHMARK_GUIDELINES: document D1/D2 sweep patterns and model_config flow.

Hardware Type: Atlas 800I A2

  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@noemotiovon
Copy link
Copy Markdown
Contributor Author

Benchmark Framework Design

This document describes the overall design of the Liger-Kernel benchmark suite, including its two benchmark dimensions, the shared infrastructure, and the phased implementation plan.

1. Benchmark Dimensions

Every operator should ideally be benchmarked along two orthogonal dimensions:

Dimension x-axis Fixed CLI Goal
D1: Non-model dimension sweep sequence length, BT, etc. model config --model Performance scaling across different input sizes
D2: Model dimension sweep hidden_size, model configs, etc. token count --bt Performance scaling across different model architectures

D1: Non-model dimension sweep (implemented)

Sweep non-model dimensions (e.g. sequence length, BT) with a fixed model config selected via --model. This is the default behavior for all benchmark scripts.

x_values:  [1024, 2048, 4096, 8192, ...]     (seq_len or BT)
fixed:     model=llama_3_8b via --model       (hidden_size=4096, intermediate_size=14336, ...)
output:    line chart — speed/memory vs token length

D2: Model dimension sweep (implemented)

Sweep model architecture dimensions (e.g. hidden_size, or discrete model configs from MODEL_REGISTRY) with a fixed token count set via --bt. This reveals how kernel performance compares across different model architectures at the same input scale.

x_values:  [llama_2_7b, llama_3_8b, ...]     (discrete model configs)
fixed:     BxT via --bt                       (determined to be safe across all configs)
output:    speedup or throughput bar chart per model config

2. D2 Design Choices

Following the maintainer discussion, we evaluated three approaches:

Approach Description Pros Cons
A: Per-parameter sweep Fix all but one model param, sweep it (e.g. sweep hidden_size with fixed intermediate_size, then vice versa) Shows per-parameter scaling trend Combinatorial; fixed values are arbitrary; model-dependent
B: N-dimensional scan Vary all model parameters simultaneously Most comprehensive data Impractical runtime; data bloat
C: Discrete model configs Run each MODEL_REGISTRY entry as one data point Cost-efficient; realistic configs; easy to maintain No continuous scaling trend

Decision: C as the primary approach, with A as optional enrichment for ops where single-parameter scaling is important.

Rationale:

  • C uses real-world model architectures, making results directly meaningful.
  • C naturally extends the existing MODEL_REGISTRY infrastructure.
  • C produces clean bar charts (speedup/throughput) that align across ops.
  • A can be layered on later for specific ops where parameter-level scaling trends matter.

3. Universal Token Length for D2

For D2 benchmarks, we need a fixed token-length that is safe (no OOM) across all model configs and all operators.

Strategy

  1. Coupled tokens: define a (batch_size, seq_len) pair per model config, e.g. (B=2, T=1024) for llama_2_7b, (B=1, T=1024) for llama_3_8b. This allows adapting to each model's memory footprint while keeping the comparison fair (same total token count or same seq_len).
  2. Safety via probe: before running D2 benchmarks, run estimate_kernel_peak_memory for each (model_config, token_config) pair. If any config would OOM, reduce token count automatically.
  3. Forward compatibility: when new ops are added, the probe mechanism ensures safety without manual tuning. If a new op cannot fit any reasonable token count for a given model, the framework skips that data point with a warning rather than crashing.

Proposed CLI

# D1 (existing): token-length sweep with fixed model
python benchmark_geglu.py --model llama_3_8b

# D2 (new): model-config sweep with fixed token length
python benchmark_geglu.py --sweep-mode model_config --bt 2048

The --sweep-mode flag selects the dimension. Default remains token_length (D1) for backward compatibility.

4. Infrastructure Changes

4.1 New config type

@dataclass(frozen=True)
class ModelConfigSweepConfig:
    """Config for D2 benchmarks that sweep across model configs."""
    model_configs: List[ModelConfig]       # models to benchmark
    bt: int                                 # fixed batch * seq_len
    batch_size: int                         # safe batch size
    seq_len: int                            # safe seq_len

4.2 New helper

def compute_model_config_sweep_config(
    model_configs: List[ModelConfig],
    probe_fn_factory: Callable[[ModelConfig, int], Callable[[], torch.Tensor]],
    bt: int = 2048,
    memory_utilization: float = 0.4,
) -> ModelConfigSweepConfig:
    """Find safe (batch_size, seq_len) that works across all model configs.

    For each model config, runs probe_fn_factory(model_config, bt) to measure
    peak memory, then picks the most conservative batch_size / seq_len.
    """
    ...

4.3 Script-level changes

Each benchmark script gains a model-config sweep code path gated by --sweep-mode:

if args.sweep_mode == "model_config":
    configs = [MODEL_REGISTRY[name] for name in MODEL_REGISTRY]
    sweep = compute_model_config_sweep_config(configs, probe_fn_factory=..., bt=args.bt)
    # x_values = model config indices
    # extra_benchmark_configs = contains all model configs
    ...
else:
    # existing token-length sweep logic
    ...

4.4 Visualization

D2 results produce grouped bar charts (speedup or throughput) rather than line charts:

  • x-axis: model config names (e.g. llama_2_7b, llama_3_8b)
  • bars: kernel providers (liger vs huggingface/torch)
  • y-axis: speedup ratio or throughput (tokens/s)

5. Phased Implementation Plan

Phase 1: Foundation (current PR)

Status: complete

  • ModelConfig and MODEL_REGISTRY with canonical model profiles
  • estimate_kernel_peak_memory() for runtime memory probing
  • compute_seq_len_sweep_config() for D1 seq_len sweeps (non-model dimension)
  • compute_hidden_size_sweep_config() for D2 hidden_size sweeps (model dimension)
  • run_speed_benchmark() / run_memory_benchmark() shared helpers
  • --model CLI argument
  • Refactored benchmark_geglu.py, benchmark_swiglu.py, benchmark_dyt.py
  • BENCHMARK_GUIDELINES.md contributor guide

Phase 2: Model-config sweep (D2)

Status: complete

  • Add ModelConfigSweepConfig dataclass
  • Implement compute_model_config_sweep_config() with cross-model probe
  • Add --sweep-mode and --bt CLI arguments to parse_benchmark_script_args()
  • Add model-config sweep code path to benchmark_geglu.py as reference implementation
  • Model-config sweep code path ported to benchmark_swiglu.py and benchmark_dyt.py
  • Validate on at least 2 devices (CUDA + NPU) to confirm OOM safety
  • Update BENCHMARK_GUIDELINES.md with D2 instructions

Phase 3: Rollout and visualization

Status: planned

  • Port D2 support to all existing benchmark scripts
  • Add bar chart / speedup visualization for D2 results
  • Integrate into CI workflows
  • Add more model profiles to MODEL_REGISTRY (e.g. Qwen-2.5-7B)

6. Directory Structure

benchmark/
├── data/
│   └── all_benchmark_data.csv
├── scripts/
│   ├── benchmark_model_configs.py      # ModelConfig, MODEL_REGISTRY, helpers
│   ├── utils.py                        # run_benchmarks, CSV, CLI
│   ├── benchmark_geglu.py              # D1 + D2
│   ├── benchmark_swiglu.py             # D1 + D2
│   ├── benchmark_dyt.py                # D1 + D2
│   └── ...
├── visualize/                          # (Phase 3) chart generation
│   └── ...
└── BENCHMARK_GUIDELINES.md             # contributor guide

@noemotiovon
Copy link
Copy Markdown
Contributor Author

noemotiovon commented Mar 25, 2026

GEGLU Test:

Script:

# D1
python scripts/benchmark_geglu.py --overwrite
python benchmarks_visualizer.py --kernel-name geglu --metric-name speed --overwrite
# D2
python scripts/benchmark_geglu.py --sweep-mode model_config --overwrite
python benchmarks_visualizer.py --kernel-name geglu --metric-name speed --overwrite --sweep-mode model_config

Result:
image
image

@noemotiovon
Copy link
Copy Markdown
Contributor Author

noemotiovon commented Mar 25, 2026

Swiglu Test:

Script:

# D1
python scripts/benchmark_swiglu.py --overwrite
python benchmarks_visualizer.py --kernel-name swiglu --metric-name speed --overwrite
# D2
python scripts/benchmark_swiglu.py --sweep-mode model_config --overwrite
python benchmarks_visualizer.py --kernel-name swiglu --metric-name speed --overwrite --sweep-mode model_config

Result:
image
image

@noemotiovon
Copy link
Copy Markdown
Contributor Author

noemotiovon commented Mar 25, 2026

Dyt Test:

Script:

export TORCH_COMPILE_DISABLE=1
# D1
python scripts/benchmark_dyt.py --overwrite
python benchmarks_visualizer.py --kernel-name dyt_beta=False --metric-name speed --overwrite
# D2
python scripts/benchmark_dyt.py --sweep-mode model_config --overwrite
python benchmarks_visualizer.py --kernel-name dyt_beta=False --metric-name speed --overwrite --sweep-mode model_config

Result:
image
image

@noemotiovon noemotiovon marked this pull request as ready for review March 25, 2026 07:34
@noemotiovon
Copy link
Copy Markdown
Contributor Author

Hi @Tcc0403, can you take a look at my code

noemotiovon and others added 3 commits March 27, 2026 03:00
- benchmark_model_configs: replace hidden-size sweep with
  compute_model_config_sweep_config / ModelConfigSweepConfig; probe each
  registry model to pick safe batch_size and seq_len for discrete sweeps.
- benchmark_geglu / benchmark_swiglu: support model_config sweep across
  MODEL_REGISTRY via _resolve_model_config_* helpers.
- benchmark_dyt: default path sweeps B*T with fixed model dimensions
  (compute_seq_len_sweep_config); optional model_config sweep; setup uses
  cfg hidden_size and input.x as BT.
- utils: allow string x / x_values for model name indices; extend types.
- benchmarks_visualizer: forward extra_config_filter to plotting.
- BENCHMARK_GUIDELINES: document D1/D2 sweep patterns and model_config flow.
…aling

- Add --sweep-mode argument (token_length|model_config) to
  benchmarks_visualizer.py for filtering data by sweep type via the
  x_name column in CSV, defaulting to token_length
- Fix x-axis scaling: convert numeric x_values to proper numeric type
  so matplotlib plots them proportionally instead of equally spaced;
  string x_values (e.g. model names) remain categorical
- Set tick labels only at actual data points for numeric axes
- Include sweep_mode suffix in output PNG filenames to avoid overwriting
  when both sweep types exist for the same kernel
- Update README.md with --sweep-mode usage and examples
  benchmarks_visualizer.py:
  - Add `--gpu-filter` CLI flag to select a specific GPU when benchmark
    data contains results from multiple devices; falls back to the most
    recent device with a warning when omitted or unmatched.
  - Extract `gpu_name_filter()` and `extra_config_filter()` as standalone
    helpers; `load_data()` now applies filters in explicit order:
    kernel/metric/mode → sweep-mode → GPU → extra config.

  BENCHMARK_GUIDELINES.md:
  - Add guideline: import baseline kernels from the test suite instead
    of duplicating reference implementations in benchmark scripts.
  - Remove the continuous hidden-size sweep variant (D2.1) and
    `compute_hidden_size_sweep_config()` reference; D2 now covers only
    the discrete model-config sweep.

Co-authored-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
@noemotiovon
Copy link
Copy Markdown
Contributor Author

Hi @Tcc0403, I’ve made updates according to the review comments. Happy to discuss further!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants