-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Background
TraceLens's core philosophy (as outlined in trace2tree_motivation.md) is to analyze kernels through the lens of top-level CPU operation names. However, in some cases we need kernel name analysis - whether because CPU op information is not available, not yet implemented for certain frameworks, or when additional kernel-level categorization provides valuable insights.
For example:
- Graph Launch Operations:
cudaGraphLaunch/hipGraphLaunchmay contain heterogeneous kernels (GEMM, attention, elementwise ops, etc.) - JAX/XLA Analysis: Kernel-level categorization helps distinguish operation types
Existing Implementation
TraceLens already has kernel name categorization for JAX analysis in util.py#L333:
class JaxOpKeys:
# keywords for splitting jax events
GemmKeys = ["Cijk", "gemm", "nvjet", "cublasLt"]
FABwdKeys = ["FmhaBwd", "flash_bprop", "ck_fused_attn::dk_dv_reduce_thd", "fmha_bwd"]
FAFwdKeys = ["FmhaFwd", "flash_fprop", "fmha_fwd"]
ConvKeys = ["FillBuffer", "conv_", "conv.", "conv-"]
CommunicationKeys = ["rccl", "nccl"]
# ... more categoriesThis categorization is currently specific to JAX workflows and not exposed as a general feature.
Proposal
To maintain TraceLens's charter as a one-stop solution for trace analysis, we should provide optional kernel name categorization in the GPU event analysis pipeline, specifically in TreePerfAnalyzer's kernel summary tables. This would add a kernel_category column based on built-in categorization patterns, enabling better grouping and analysis.
Proposed API
# Simple opt-in via boolean flag
perf_analyzer = TreePerfAnalyzer(
trace_file,
use_kernel_name_categories=True # optional, defaults to False
)
# Also available in reporting
generate_perf_report_pytorch(
trace_file,
use_kernel_name_categories=True,
)Example: Before and After
Before (current behavior)
Kernel summary table for cudaGraphLaunch operations:
| cpu_op | kernel_name | count | total_time_ms | avg_time_us |
|---|---|---|---|---|
| graph | kernel_mha | 1 | 0.15 | 150 |
| graph | void tensorrt_llm::common::scaleMatrix<...> | 1 | 0.08 | 80 |
| graph | nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT | 1 | 0.22 | 220 |
Issue: All kernels grouped under generic "graph" CPU op name - difficult to see what types of operations dominate.
After (with kernel categorization enabled)
Kernel summary table with kernel_category column:
| cpu_op | kernel_category | kernel_name | count | total_time_ms | avg_time_us |
|---|---|---|---|---|---|
| graph | Flash Attention Forward | kernel_mha | 1 | 0.15 | 150 |
| graph | Memory Ops | void tensorrt_llm::common::scaleMatrix<...> | 1 | 0.08 | 80 |
| graph | GEMM | nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT | 1 | 0.22 | 220 |
Grouped by kernel_category:
| cpu_op | kernel_category | count | total_time_ms | % of graph time |
|---|---|---|---|---|
| graph | GEMM | 1 | 0.22 | 48.9% |
| graph | Flash Attention Forward | 1 | 0.15 | 33.3% |
| graph | Memory Ops | 1 | 0.08 | 17.8% |
Benefit: Immediately see that GEMM operations dominate the graph launch, even though all have the same CPU op parent.
Implementation Approach
Incremental implementation:
-
This Issue: Add kernel name categorization directly in GPU event analyzer and then use in the TreePerfAnalyzer and generate perf report.
-
Future PRs:
- Refactor categorization logic from JAX-specific
JaxOpKeys(inutil.py) to use the new general categorizer - Other analysis modules can leverage the same categorizer as needed
- Add extension to user can customize further
- Refactor categorization logic from JAX-specific