Skip to content

Implement Optional Kernel Name Categorizer in GPU Event Analysis #428

@ajassani

Description

@ajassani

Background

TraceLens's core philosophy (as outlined in trace2tree_motivation.md) is to analyze kernels through the lens of top-level CPU operation names. However, in some cases we need kernel name analysis - whether because CPU op information is not available, not yet implemented for certain frameworks, or when additional kernel-level categorization provides valuable insights.

For example:

  • Graph Launch Operations: cudaGraphLaunch/hipGraphLaunch may contain heterogeneous kernels (GEMM, attention, elementwise ops, etc.)
  • JAX/XLA Analysis: Kernel-level categorization helps distinguish operation types

Existing Implementation

TraceLens already has kernel name categorization for JAX analysis in util.py#L333:

class JaxOpKeys:
    # keywords for splitting jax events
    GemmKeys = ["Cijk", "gemm", "nvjet", "cublasLt"]
    FABwdKeys = ["FmhaBwd", "flash_bprop", "ck_fused_attn::dk_dv_reduce_thd", "fmha_bwd"]
    FAFwdKeys = ["FmhaFwd", "flash_fprop", "fmha_fwd"]
    ConvKeys = ["FillBuffer", "conv_", "conv.", "conv-"]
    CommunicationKeys = ["rccl", "nccl"]
    # ... more categories

This categorization is currently specific to JAX workflows and not exposed as a general feature.

Proposal

To maintain TraceLens's charter as a one-stop solution for trace analysis, we should provide optional kernel name categorization in the GPU event analysis pipeline, specifically in TreePerfAnalyzer's kernel summary tables. This would add a kernel_category column based on built-in categorization patterns, enabling better grouping and analysis.

Proposed API

# Simple opt-in via boolean flag
perf_analyzer = TreePerfAnalyzer(
    trace_file,
    use_kernel_name_categories=True  # optional, defaults to False
)

# Also available in reporting
generate_perf_report_pytorch(
    trace_file,
    use_kernel_name_categories=True,
)

Example: Before and After

Before (current behavior)

Kernel summary table for cudaGraphLaunch operations:

cpu_op kernel_name count total_time_ms avg_time_us
graph kernel_mha 1 0.15 150
graph void tensorrt_llm::common::scaleMatrix<...> 1 0.08 80
graph nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT 1 0.22 220

Issue: All kernels grouped under generic "graph" CPU op name - difficult to see what types of operations dominate.

After (with kernel categorization enabled)

Kernel summary table with kernel_category column:

cpu_op kernel_category kernel_name count total_time_ms avg_time_us
graph Flash Attention Forward kernel_mha 1 0.15 150
graph Memory Ops void tensorrt_llm::common::scaleMatrix<...> 1 0.08 80
graph GEMM nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT 1 0.22 220

Grouped by kernel_category:

cpu_op kernel_category count total_time_ms % of graph time
graph GEMM 1 0.22 48.9%
graph Flash Attention Forward 1 0.15 33.3%
graph Memory Ops 1 0.08 17.8%

Benefit: Immediately see that GEMM operations dominate the graph launch, even though all have the same CPU op parent.

Implementation Approach

Incremental implementation:

  1. This Issue: Add kernel name categorization directly in GPU event analyzer and then use in the TreePerfAnalyzer and generate perf report.

  2. Future PRs:

    • Refactor categorization logic from JAX-specific JaxOpKeys (in util.py) to use the new general categorizer
    • Other analysis modules can leverage the same categorizer as needed
    • Add extension to user can customize further

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions