[SGLANG] [Benchmarks] Initial integration of sglang kernels to benchmarks#6789
[SGLANG] [Benchmarks] Initial integration of sglang kernels to benchmarks#6789chengjunlu wants to merge 5 commits intomainfrom
Conversation
d3de9c0 to
69b7afe
Compare
There was a problem hiding this comment.
Pull request overview
Integrates SGLang Triton kernels into the repo’s benchmark harness and wires them into the “third party benchmarks” GitHub Actions workflow so their performance can be captured and reported alongside existing benchmarks.
Changes:
- Add new benchmark entrypoints for SGLang attention (prefill/decode/extended) under
benchmarks/triton_kernels_benchmark/. - Add a standalone Triton FP8 block GEMM benchmark derived from SGLang’s FP8 kernel.
- Extend
.github/workflows/third-party-benchmarks.ymlto install SGLang and run the new benchmarks, producing CSV reports.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| benchmarks/triton_kernels_benchmark/prefill_attention_benchmark.py | Adds a prefill attention benchmark driver for SGLang’s context_attention_fwd. |
| benchmarks/triton_kernels_benchmark/decode_attention_benchmark.py | Adds a decode attention benchmark driver for SGLang’s decode_attention_fwd. |
| benchmarks/triton_kernels_benchmark/extended_attention_benchmark.py | Adds an “extended/append” attention benchmark driver for SGLang’s extend_attention_fwd. |
| benchmarks/triton_kernels_benchmark/block_fp8_gemm_benchmark.py | Adds a Triton FP8 block GEMM benchmark (and a native Torch reference) for correctness/perf. |
| .github/workflows/third-party-benchmarks.yml | Installs dependencies (PTI), installs SGLang, and runs the new benchmarks + report generation. |
| As: The per-token-group quantization scale for `A`. | ||
| Bs: The per-block quantization scale for `B`. | ||
| block_size: The block size for per-block quantization. It should be 2-dim, e.g., [128, 128]. | ||
| output_dytpe: The dtype of the returned tensor. |
There was a problem hiding this comment.
Docstring: output_dytpe is misspelled and doesn’t match the actual argument name output_dtype, which can confuse users of this helper.
| output_dytpe: The dtype of the returned tensor. | |
| output_dtype: The dtype of the returned tensor. |
| C = A.new_empty(C_shape, dtype=output_dtype) | ||
|
|
||
| # Default config | ||
| # Block-wise quant: BLOCK_SIZE_K must be divisable by block_size[1] |
There was a problem hiding this comment.
Comment typo: “divisable” should be “divisible”.
| # Block-wise quant: BLOCK_SIZE_K must be divisable by block_size[1] | |
| # Block-wise quant: BLOCK_SIZE_K must be divisible by block_size[1] |
| run: | | ||
| git clone https://github.com/sgl-project/sglang.git | ||
| cd sglang | ||
| git apply ../benchmarks/third_party/sglang/sglang-fix.patch |
There was a problem hiding this comment.
This step applies ../benchmarks/third_party/sglang/sglang-fix.patch, but that file doesn’t exist in benchmarks/third_party/sglang/ (only sglang-bench-fix.patch / sglang-test-fix.patch are present). CI will fail at git apply; update the filename(s) or add the missing patch.
| git apply ../benchmarks/third_party/sglang/sglang-fix.patch | |
| git apply ../benchmarks/third_party/sglang/sglang-bench-fix.patch |
|
|
||
| source ../../../scripts/capture-hw-details.sh | ||
| python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-extended-attn-performance.csv $REPORTS/sglang-append-attn-triton-report.csv --benchmark sglang-extended-attn --compiler triton --param_cols "B,Q_LEN,PREFIX_LEN,KV_LEN,H_Q,H_KV,D" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG |
There was a problem hiding this comment.
Same incorrect build_report.py path as above; additionally, --param_cols "B,Q_LEN,PREFIX_LEN,KV_LEN,H_Q,H_KV,D" doesn’t match the columns produced by extended_attention_benchmark.py (it uses EXTEND_LEN/PREFIX_LEN and has no Q_LEN/KV_LEN). build_report.py will raise due to missing columns—update --param_cols to match the benchmark’s CSV headers.
| source ../../../scripts/capture-hw-details.sh | ||
| python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-fp8-gemm-performance.csv $REPORTS/sglang-fp8-gemm-triton-report.csv --benchmark sglang-block-fp8-gemm --compiler triton --param_cols "M,N,K" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG |
There was a problem hiding this comment.
Same incorrect build_report.py path as above (this will fail from benchmarks/triton_kernels_benchmark).
| source ../../../scripts/capture-hw-details.sh | ||
| python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-decode-attn-performance.csv $REPORTS/sglang-decode-attn-triton-report.csv --benchmark sglang-decode-attn --compiler triton --param_cols "B,SEQ_LENS,H_Q,H_KV,D" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG | ||
|
|
There was a problem hiding this comment.
Same issue here: from benchmarks/triton_kernels_benchmark, ../../triton_kernels_benchmark/build_report.py resolves to a non-existent path and will break report generation.
| import triton_kernels_benchmark as benchmark_suit | ||
|
|
There was a problem hiding this comment.
Minor naming consistency: this file uses benchmark_suit as the alias for triton_kernels_benchmark, while the rest of the benchmarks typically use benchmark_suite. Consider renaming to match the established alias.
| import triton_kernels_benchmark as benchmark_suit | |
| import triton_kernels_benchmark as benchmark_suite | |
| benchmark_suit = benchmark_suite |
| raise NotImplementedError(f"Unsupported provider {provider}") | ||
|
|
||
| tflops = lambda ms: 2 * M * N * K * (1e-12) / (ms * 1e-3) | ||
| gbps = lambda ms: (M * K + K * N) + 2.0 * (M * N) * (1e-9) / (ms * 1e-3) |
There was a problem hiding this comment.
The GB/s calculation is missing parentheses/time normalization: (M*K + K*N) is being added as a raw element count instead of being converted to GB and divided by runtime. This will produce incorrect bandwidth numbers in the report.
| gbps = lambda ms: (M * K + K * N) + 2.0 * (M * N) * (1e-9) / (ms * 1e-3) | |
| gbps = lambda ms: ((M * K + K * N) + 2.0 * (M * N)) * (1e-9) / (ms * 1e-3) |
| source ../../../scripts/capture-hw-details.sh | ||
| python ../../triton_kernels_benchmark/build_report.py $REPORTS/sglang-prefill-attn-performance.csv $REPORTS/sglang-prefill-attn-triton-report.csv --benchmark sglang-prefill-attn --compiler triton --param_cols "B,SEQ_LENS,H_Q,H_KV,D,CAUSAL" --tflops_col Triton-TFlops --hbm_col "Triton-GB/s" --tag $TAG | ||
|
|
There was a problem hiding this comment.
build_report.py is in benchmarks/triton_kernels_benchmark/. From this step’s working directory (benchmarks/triton_kernels_benchmark), the path ../../triton_kernels_benchmark/build_report.py points to a non-existent top-level triton_kernels_benchmark/ directory and will fail. Use python build_report.py ... (or the correct relative path) instead.
| import torch | ||
| from sglang.srt.layers.attention.triton_ops.extend_attention import ( | ||
| extend_attention_fwd, ) | ||
| import triton_kernels_benchmark as benchmark_suit |
There was a problem hiding this comment.
Minor naming consistency: other benchmarks typically use benchmark_suite as the alias for triton_kernels_benchmark, but this file uses benchmark_suit (missing “e”). Renaming would align with the rest of the repo.
| import triton_kernels_benchmark as benchmark_suit | |
| import triton_kernels_benchmark as benchmark_suite | |
| benchmark_suit = benchmark_suite |
Port prefill attn and decode attn from sglang Add validation temp add extend attention disable debug ir dump Update three stage attention benchmark Add sglang kernel benchmark to action use 1e-3 atol remove sglang benchmark from triton-benchmarks Fix setup bdist_wheel Add sglang to thirdparty test Address review comments Remove sglang from tests Fix CI Address review comments Integrate sglang prefill/decode/extend kernel to benchmarks Port prefill attn and decode attn from sglang Add validation temp add extend attention disable debug ir dump Update three stage attention benchmark Add sglang kernel benchmark to action use 1e-3 atol remove sglang benchmark from triton-benchmarks Fix setup bdist_wheel Add sglang to thirdparty test Address review comments Remove sglang from tests Adjust params term Adjust tflops computation
fix bugs rtol atol Move fp8 gemm to sglang benchmark
Address review comments Fix CI XPU not found
| # o will have the same shape as q | ||
| o = torch.zeros(B, H_Q, D, dtype=dtype, device=device) | ||
|
|
||
| b_seq_len = torch.full((B, ), N_CTX, device=device) |
There was a problem hiding this comment.
b_seq_len should be dtype=torch.int32 explicitly — SGLang's decode_attention_fwd expects int32 and the later cumsum result gets silently cast into the int32 kv_indptr.
| quantiles = [0.5, 0.0, 1.0] | ||
| if provider == 'triton' and MODE == 'fwd': | ||
| triton_fn = lambda: context_attention_fwd(q, k, v, o, b_start_loc, b_seq_len, max_seq_len, is_causal=CAUSAL) | ||
| _, min_ms, max_ms, mean_ms, cv = benchmark_suit.do_bench(triton_fn, n_warmup=10, n_repeat=10, |
There was a problem hiding this comment.
No numerical validation against a torch reference before timing — unlike block_fp8_gemm_benchmark.py, the three attention benchmarks (prefill/decode/extended) can silently report plausible numbers on a broken kernel.
This PR continue the work in #3796.
The initial enabling for sglang benchmarks.
Include sglang prefill/decode/extended attention and fp8 quant gemm into third-party benchmark.