Skip to content

[SGLANG] [Benchmarks] Initial integration of sglang kernels to benchmarks#6789

Open
chengjunlu wants to merge 5 commits intomainfrom
chengjun/init_sglang_benchmark
Open

[SGLANG] [Benchmarks] Initial integration of sglang kernels to benchmarks#6789
chengjunlu wants to merge 5 commits intomainfrom
chengjun/init_sglang_benchmark

Conversation

@chengjunlu
Copy link
Copy Markdown
Contributor

This PR continue the work in #3796.

The initial enabling for sglang benchmarks.
Include sglang prefill/decode/extended attention and fp8 quant gemm into third-party benchmark.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates SGLang Triton kernels into the repo’s benchmark harness and wires them into the “third party benchmarks” GitHub Actions workflow so their performance can be captured and reported alongside existing benchmarks.

Changes:

  • Add new benchmark entrypoints for SGLang attention (prefill/decode/extended) under benchmarks/triton_kernels_benchmark/.
  • Add a standalone Triton FP8 block GEMM benchmark derived from SGLang’s FP8 kernel.
  • Extend .github/workflows/third-party-benchmarks.yml to install SGLang and run the new benchmarks, producing CSV reports.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
benchmarks/triton_kernels_benchmark/prefill_attention_benchmark.py Adds a prefill attention benchmark driver for SGLang’s context_attention_fwd.
benchmarks/triton_kernels_benchmark/decode_attention_benchmark.py Adds a decode attention benchmark driver for SGLang’s decode_attention_fwd.
benchmarks/triton_kernels_benchmark/extended_attention_benchmark.py Adds an “extended/append” attention benchmark driver for SGLang’s extend_attention_fwd.
benchmarks/triton_kernels_benchmark/block_fp8_gemm_benchmark.py Adds a Triton FP8 block GEMM benchmark (and a native Torch reference) for correctness/perf.
.github/workflows/third-party-benchmarks.yml Installs dependencies (PTI), installs SGLang, and runs the new benchmarks + report generation.

As: The per-token-group quantization scale for `A`.
Bs: The per-block quantization scale for `B`.
block_size: The block size for per-block quantization. It should be 2-dim, e.g., [128, 128].
output_dytpe: The dtype of the returned tensor.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring: output_dytpe is misspelled and doesn’t match the actual argument name output_dtype, which can confuse users of this helper.

Suggested change
output_dytpe: The dtype of the returned tensor.
output_dtype: The dtype of the returned tensor.

Copilot uses AI. Check for mistakes.
C = A.new_empty(C_shape, dtype=output_dtype)

# Default config
# Block-wise quant: BLOCK_SIZE_K must be divisable by block_size[1]
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment typo: “divisable” should be “divisible”.

Suggested change
# Block-wise quant: BLOCK_SIZE_K must be divisable by block_size[1]
# Block-wise quant: BLOCK_SIZE_K must be divisible by block_size[1]

Copilot uses AI. Check for mistakes.
run: |
git clone https://github.com/sgl-project/sglang.git
cd sglang
git apply ../benchmarks/third_party/sglang/sglang-fix.patch
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step applies ../benchmarks/third_party/sglang/sglang-fix.patch, but that file doesn’t exist in benchmarks/third_party/sglang/ (only sglang-bench-fix.patch / sglang-test-fix.patch are present). CI will fail at git apply; update the filename(s) or add the missing patch.

Suggested change
git apply ../benchmarks/third_party/sglang/sglang-fix.patch
git apply ../benchmarks/third_party/sglang/sglang-bench-fix.patch

Copilot uses AI. Check for mistakes.
Comment on lines +270 to +272

source ../../../scripts/capture-hw-details.sh
python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-extended-attn-performance.csv $REPORTS/sglang-append-attn-triton-report.csv --benchmark sglang-extended-attn --compiler triton --param_cols "B,Q_LEN,PREFIX_LEN,KV_LEN,H_Q,H_KV,D" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same incorrect build_report.py path as above; additionally, --param_cols "B,Q_LEN,PREFIX_LEN,KV_LEN,H_Q,H_KV,D" doesn’t match the columns produced by extended_attention_benchmark.py (it uses EXTEND_LEN/PREFIX_LEN and has no Q_LEN/KV_LEN). build_report.py will raise due to missing columns—update --param_cols to match the benchmark’s CSV headers.

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +282
source ../../../scripts/capture-hw-details.sh
python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-fp8-gemm-performance.csv $REPORTS/sglang-fp8-gemm-triton-report.csv --benchmark sglang-block-fp8-gemm --compiler triton --param_cols "M,N,K" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same incorrect build_report.py path as above (this will fail from benchmarks/triton_kernels_benchmark).

Copilot uses AI. Check for mistakes.
Comment on lines +261 to +263
source ../../../scripts/capture-hw-details.sh
python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-decode-attn-performance.csv $REPORTS/sglang-decode-attn-triton-report.csv --benchmark sglang-decode-attn --compiler triton --param_cols "B,SEQ_LENS,H_Q,H_KV,D" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: from benchmarks/triton_kernels_benchmark, ../../triton_kernels_benchmark/build_report.py resolves to a non-existent path and will break report generation.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +15
import triton_kernels_benchmark as benchmark_suit

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming consistency: this file uses benchmark_suit as the alias for triton_kernels_benchmark, while the rest of the benchmarks typically use benchmark_suite. Consider renaming to match the established alias.

Suggested change
import triton_kernels_benchmark as benchmark_suit
import triton_kernels_benchmark as benchmark_suite
benchmark_suit = benchmark_suite

Copilot uses AI. Check for mistakes.
raise NotImplementedError(f"Unsupported provider {provider}")

tflops = lambda ms: 2 * M * N * K * (1e-12) / (ms * 1e-3)
gbps = lambda ms: (M * K + K * N) + 2.0 * (M * N) * (1e-9) / (ms * 1e-3)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GB/s calculation is missing parentheses/time normalization: (M*K + K*N) is being added as a raw element count instead of being converted to GB and divided by runtime. This will produce incorrect bandwidth numbers in the report.

Suggested change
gbps = lambda ms: (M * K + K * N) + 2.0 * (M * N) * (1e-9) / (ms * 1e-3)
gbps = lambda ms: ((M * K + K * N) + 2.0 * (M * N)) * (1e-9) / (ms * 1e-3)

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +253
source ../../../scripts/capture-hw-details.sh
python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-prefill-attn-performance.csv $REPORTS/sglang-prefill-attn-triton-report.csv --benchmark sglang-prefill-attn --compiler triton --param_cols "B,SEQ_LENS,H_Q,H_KV,D,CAUSAL" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_report.py is in benchmarks/triton_kernels_benchmark/. From this step’s working directory (benchmarks/triton_kernels_benchmark), the path ../../triton_kernels_benchmark/build_report.py points to a non-existent top-level triton_kernels_benchmark/ directory and will fail. Use python build_report.py ... (or the correct relative path) instead.

Copilot uses AI. Check for mistakes.
import torch
from sglang.srt.layers.attention.triton_ops.extend_attention import (
extend_attention_fwd, )
import triton_kernels_benchmark as benchmark_suit
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming consistency: other benchmarks typically use benchmark_suite as the alias for triton_kernels_benchmark, but this file uses benchmark_suit (missing “e”). Renaming would align with the rest of the repo.

Suggested change
import triton_kernels_benchmark as benchmark_suit
import triton_kernels_benchmark as benchmark_suite
benchmark_suit = benchmark_suite

Copilot uses AI. Check for mistakes.
leonling-ll and others added 5 commits April 30, 2026 09:20
Port prefill attn and decode attn from sglang

Add validation

temp add extend attention

disable debug ir dump

Update three stage attention benchmark

Add sglang kernel benchmark to action

use 1e-3 atol

remove sglang benchmark from triton-benchmarks

Fix setup bdist_wheel

Add sglang to thirdparty test

Address review comments

Remove sglang from tests

Fix CI

Address review comments

Integrate sglang prefill/decode/extend kernel to benchmarks

Port prefill attn and decode attn from sglang

Add validation

temp add extend attention

disable debug ir dump

Update three stage attention benchmark

Add sglang kernel benchmark to action

use 1e-3 atol

remove sglang benchmark from triton-benchmarks

Fix setup bdist_wheel

Add sglang to thirdparty test

Address review comments

Remove sglang from tests

Adjust params term

Adjust tflops computation
fix bugs

rtol

atol

Move fp8 gemm to sglang benchmark
Address review comments

Fix CI XPU not found
# o will have the same shape as q
o = torch.zeros(B, H_Q, D, dtype=dtype, device=device)

b_seq_len = torch.full((B, ), N_CTX, device=device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b_seq_len should be dtype=torch.int32 explicitly — SGLang's decode_attention_fwd expects int32 and the later cumsum result gets silently cast into the int32 kv_indptr.

quantiles = [0.5, 0.0, 1.0]
if provider == 'triton' and MODE == 'fwd':
triton_fn = lambda: context_attention_fwd(q, k, v, o, b_start_loc, b_seq_len, max_seq_len, is_causal=CAUSAL)
_, min_ms, max_ms, mean_ms, cv = benchmark_suit.do_bench(triton_fn, n_warmup=10, n_repeat=10,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No numerical validation against a torch reference before timing — unlike block_fp8_gemm_benchmark.py, the three attention benchmarks (prefill/decode/extended) can silently report plausible numbers on a broken kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SGLANG] add sglang block fp8 gemm kernels into benchmark

5 participants