-
Notifications
You must be signed in to change notification settings - Fork 423
[Feat] profiler support cudagraph backend #1658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughThis PR introduces a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @tilelang/profiler/__init__.py:
- Around line 214-217: The parameter `input_tensors` in the profiler signature
is annotated as `list[torch.Tensor] = None` which implicitly uses None without
Optional; change the annotation to explicitly allow None (e.g.
`Optional[list[torch.Tensor]] = None` or `list[torch.Tensor] | None = None`) and
add the necessary import (`from typing import Optional`) if using Optional;
update the `__init__` (or the function) signature where `input_tensors` is
declared to use the explicit optional type.
🧹 Nitpick comments (2)
tilelang/profiler/bench.py (1)
209-259: CUDA graph benchmarking implementation - cache flushing semantics differ from event backend.The implementation looks functional, but there's a semantic difference from the event backend that could affect timing accuracy:
- Event backend: Clears L2 cache before each
fn()call (line 154)- Cudagraph backend: Clears L2 cache once before replaying a graph containing
n_repeatiterations (line 240)This means cudagraph measurements may include cache hits for iterations 2 through n_repeat, potentially reporting faster times than the event backend for cache-sensitive kernels.
Additionally, consider:
n_retries=10is hardcoded - could be a parameter for consistency with other configurability- No error handling for graph capture failures (some operations aren't CUDA graph-capturable)
Optional: Add error handling for graph capture
def _bench_with_cudagraph( fn: Callable, cache: torch.Tensor, n_repeat: int, quantiles: list[float] | None, return_mode: str, ) -> float | list[float]: """Benchmark using CUDA graph for minimal launch overhead. ... """ - with torch.cuda.stream(torch.cuda.Stream()): - # Construct a CUDA graph with n_repeat unrolled function calls - g = torch.cuda.CUDAGraph() - with torch.cuda.graph(g): - for _ in range(n_repeat): - fn() + try: + with torch.cuda.stream(torch.cuda.Stream()): + # Construct a CUDA graph with n_repeat unrolled function calls + g = torch.cuda.CUDAGraph() + with torch.cuda.graph(g): + for _ in range(n_repeat): + fn() + except Exception as e: + raise RuntimeError( + f"CUDA graph capture failed. The function may contain operations " + f"that are not graph-capturable. Consider using 'event' backend. " + f"Original error: {e}" + ) from eexamples/gemm/example_gemm_autotune.py (1)
248-248: Consider validating the--profile_backendCLI argument.The CLI argument accepts any string, but only
"event","cupti", and"cudagraph"are valid backends. Invalid values will cause a runtime error deep in the profiler.Add choices validation
- parser.add_argument("--profile_backend", type=str, default="event", help="Profiler backend") + parser.add_argument("--profile_backend", type=str, default="event", choices=["event", "cupti", "cudagraph"], help="Profiler backend")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/gemm/example_gemm_autotune.pytilelang/autotuner/param.pytilelang/autotuner/tuner.pytilelang/profiler/__init__.pytilelang/profiler/bench.py
🧰 Additional context used
🧬 Code graph analysis (2)
examples/gemm/example_gemm_autotune.py (2)
tilelang/profiler/__init__.py (1)
do_bench(209-269)tilelang/profiler/bench.py (1)
do_bench(64-137)
tilelang/autotuner/tuner.py (2)
tilelang/profiler/__init__.py (1)
do_bench(209-269)tilelang/profiler/bench.py (1)
do_bench(64-137)
🪛 Ruff (0.14.10)
tilelang/profiler/__init__.py
216-216: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
🔇 Additional comments (7)
tilelang/autotuner/param.py (1)
108-108: LGTM! Backend field properly added to ProfileArgs.The
backendfield is correctly added with appropriate type annotation and default value, and properly included in the hash computation to ensure cache invalidation when the profiling backend changes.Also applies to: 119-131
tilelang/profiler/__init__.py (1)
242-251: Backend parameter correctly propagated.The backend parameter is properly threaded through to the underlying
do_benchcall fromtilelang.profiler.bench.tilelang/autotuner/tuner.py (2)
454-460: Potential inconsistency in benchmarking parameters between kernel and reference.The kernel benchmarking at line 454 uses
warmup=warmup, rep=rep(time-based auto-calculation of iterations), while the reference benchmarking at lines 458-459 usesn_warmup=warmup, n_repeat=rep(fixed iteration counts). This means withwarmup=3, rep=20:
- Kernel: Aims for ~3ms warmup time, ~20ms total benchmark time
- Reference: Runs exactly 3 warmup iterations and 20 benchmark iterations
This asymmetry may produce inconsistent timing comparisons between the kernel and its reference implementation.
Was this intentional? If both should use the same benchmarking strategy, consider aligning the parameter usage:
Option: Use consistent time-based approach for both
latency = profiler.do_bench(warmup=warmup, rep=rep, input_tensors=self.jit_input_tensors, backend=backend) if self.ref_latency_cache is None and ref_prog is not None: self.ref_input_tensors = ref_input_tensors_supply() self.ref_latency_cache = profiler.do_bench( - ref_prog, n_warmup=warmup, n_repeat=rep, input_tensors=self.ref_input_tensors, backend=backend + ref_prog, warmup=warmup, rep=rep, input_tensors=self.ref_input_tensors, backend=backend )
211-212: Backend parameter properly added to set_profile_args.The backend parameter is correctly added to the method signature and propagated to
ProfileArgsconstruction.Also applies to: 252-252
tilelang/profiler/bench.py (1)
134-135: Backend dispatch correctly added for cudagraph.The dispatch logic properly routes to
_bench_with_cudagraphwhenbackend == "cudagraph".examples/gemm/example_gemm_autotune.py (2)
79-81: Verify:block_Ndepends onMinstead ofN.Both
block_Mandblock_Nuse the conditionM > 32. While this might be intentional (using M as a proxy for "small matrix"), it could also be an oversight whereblock_Nshould depend onN:block_M = [64, 128, 256] if M > 32 else [16, 32] block_N = [64, 128, 256] if M > 32 else [16, 32] # Should this check N instead?If this is intentional (e.g., for small square-ish matrices), consider adding a comment to clarify the reasoning.
110-162: Backend parameter threading looks correct throughout the example.The
profile_backendparameter is properly threaded throughget_best_config,main, and bothdo_benchcalls for consistent profiling behavior.Also applies to: 211-226
| n_warmup: int = 0, | ||
| n_repeat: int = 0, | ||
| input_tensors: list[torch.Tensor] = None, | ||
| backend: Literal["event", "cupti"] = "event", | ||
| backend: Literal["event", "cupti", "cudagraph"] = "event", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation issue: implicit Optional for input_tensors.
As flagged by static analysis, the input_tensors parameter has None as default but lacks Optional in the type hint. PEP 484 prohibits implicit Optional.
🔧 Suggested fix
def do_bench(
self,
func: Callable | None = None,
warmup: int = 25,
rep: int = 100,
n_warmup: int = 0,
n_repeat: int = 0,
- input_tensors: list[torch.Tensor] = None,
+ input_tensors: list[torch.Tensor] | None = None,
backend: Literal["event", "cupti", "cudagraph"] = "event",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| n_warmup: int = 0, | |
| n_repeat: int = 0, | |
| input_tensors: list[torch.Tensor] = None, | |
| backend: Literal["event", "cupti"] = "event", | |
| backend: Literal["event", "cupti", "cudagraph"] = "event", | |
| n_warmup: int = 0, | |
| n_repeat: int = 0, | |
| input_tensors: list[torch.Tensor] | None = None, | |
| backend: Literal["event", "cupti", "cudagraph"] = "event", |
🧰 Tools
🪛 Ruff (0.14.10)
216-216: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
🤖 Prompt for AI Agents
In @tilelang/profiler/__init__.py around lines 214 - 217, The parameter
`input_tensors` in the profiler signature is annotated as `list[torch.Tensor] =
None` which implicitly uses None without Optional; change the annotation to
explicitly allow None (e.g. `Optional[list[torch.Tensor]] = None` or
`list[torch.Tensor] | None = None`) and add the necessary import (`from typing
import Optional`) if using Optional; update the `__init__` (or the function)
signature where `input_tensors` is declared to use the explicit optional type.
LeiWang1999
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! left some comments :)
| rep: int = 100, | ||
| n_warmup: int = 1, | ||
| n_repeat: int = 1, | ||
| n_warmup: int = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to change the default value from 1 to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "cudagraph" backend, n_repeat = 0 avoids measuring a single-iteration graph where launch overhead is still present.
We also align n_repeat / n_warmup with bench.py::do_bench so the default behavior is consistent.
| return kernel_time_us * 1e-3 # Convert microseconds to milliseconds | ||
|
|
||
|
|
||
| def _bench_with_cudagraph( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want better understand the advantages of using CUDA Graphs for benchmarking. While I understand it likely reduces CPU overhead and kernel launch times, could we not simply use the CUPTI profiler if we only need to measure the kernel execution time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right that, functionally, CUPTI could be used to measure kernel execution time here. The main reason we prefer CUDA Graphs is that they better reflect inference-style execution, where the same kernels are executed repeatedly in a steady-state loop rather than as isolated launches.
This distinction can also matter for autotuning, since different measurement contexts may bias the tuning process differently. While we haven’t observed divergent optimal configurations in our current tests, using CUDA Graphs helps ensure that the benchmarking setup is aligned with the intended inference execution pattern.
this pr is related to #1633, add cudagraph backend to profiler and allows AutoTuner to specify the profiling backend.
Changes
1. Profiler CUDA Graph Backend
tilelang/profiler/bench.py:_bench_with_cudagraph()function implementing CUDA Graph-based benchmarkingdo_bench()to support"cudagraph"backend optiontriton.testing.do_bench_cudagraphpatterntilelang/profiler/__init__.py:backendparameter to include"cudagraph"optionn_warmupandn_repeat(changed from 1 to 0)2. AutoTuner Backend Selection
tilelang/autotuner/param.py:backendparameter toProfileArgsclass with typeLiteral["event", "cupti", "cudagraph"]tilelang/autotuner/tuner.py:backendparameter toset_profile_args()method_profile()to use specified backend for both kernel and reference program benchmarking3. Example Updates
examples/gemm/example_gemm_autotune.py:profile_backendparameter toget_best_config()andmain()--profile_backendCLI argumentUsage
Bench Results