Implement Optional Kernel Name Categorizer in GPU Event Analysis

## Background

TraceLens's core philosophy (as outlined in [trace2tree_motivation.md](https://github.com/AMD-AGI/TraceLens/blob/main/docs/conceptual/trace2tree_motivation.md)) is to analyze kernels through the lens of top-level CPU operation names. However, **in some cases we need kernel name analysis** - whether because CPU op information is not available, not yet implemented for certain frameworks, or when additional kernel-level categorization provides valuable insights.

For example:
- **Graph Launch Operations**: `cudaGraphLaunch`/`hipGraphLaunch` may contain heterogeneous kernels (GEMM, attention, elementwise ops, etc.)
- **JAX/XLA Analysis**: Kernel-level categorization helps distinguish operation types


## Existing Implementation

TraceLens already has kernel name categorization for JAX analysis in [`util.py#L333`](https://github.com/AMD-AGI/TraceLens/blob/main/TraceLens/util.py#L333):

```python
class JaxOpKeys:
    # keywords for splitting jax events
    GemmKeys = ["Cijk", "gemm", "nvjet", "cublasLt"]
    FABwdKeys = ["FmhaBwd", "flash_bprop", "ck_fused_attn::dk_dv_reduce_thd", "fmha_bwd"]
    FAFwdKeys = ["FmhaFwd", "flash_fprop", "fmha_fwd"]
    ConvKeys = ["FillBuffer", "conv_", "conv.", "conv-"]
    CommunicationKeys = ["rccl", "nccl"]
    # ... more categories
```

This categorization is currently specific to JAX workflows and not exposed as a general feature.

## Proposal

To maintain TraceLens's charter as a **one-stop solution** for trace analysis, we should provide **optional kernel name categorization** in the GPU event analysis pipeline, specifically in `TreePerfAnalyzer`'s kernel summary tables. This would add a `kernel_category` column based on built-in categorization patterns, enabling better grouping and analysis.

## Proposed API

```python
# Simple opt-in via boolean flag
perf_analyzer = TreePerfAnalyzer(
    trace_file,
    use_kernel_name_categories=True  # optional, defaults to False
)

# Also available in reporting
generate_perf_report_pytorch(
    trace_file,
    use_kernel_name_categories=True,
)
```

## Example: Before and After

### Before (current behavior)
Kernel summary table for `cudaGraphLaunch` operations:

| cpu_op | kernel_name | count | total_time_ms | avg_time_us |
|--------|-------------|-------|---------------|-------------|
| graph | kernel_mha | 1 | 0.15 | 150 |
| graph | void tensorrt_llm::common::scaleMatrix<...> | 1 | 0.08 | 80 |
| graph | nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT | 1 | 0.22 | 220 |

**Issue**: All kernels grouped under generic "graph" CPU op name - difficult to see what types of operations dominate.

### After (with kernel categorization enabled)
Kernel summary table with `kernel_category` column:

| cpu_op | kernel_category | kernel_name | count | total_time_ms | avg_time_us |
|--------|-----------------|-------------|-------|---------------|-------------|
| graph | Flash Attention Forward | kernel_mha | 1 | 0.15 | 150 |
| graph | Memory Ops | void tensorrt_llm::common::scaleMatrix<...> | 1 | 0.08 | 80 |
| graph | GEMM | nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT | 1 | 0.22 | 220 |

**Grouped by kernel_category:**

| cpu_op | kernel_category | count | total_time_ms | % of graph time |
|--------|-----------------|-------|---------------|-----------------|
| graph | GEMM | 1 | 0.22 | 48.9% |
| graph | Flash Attention Forward | 1 | 0.15 | 33.3% |
| graph | Memory Ops | 1 | 0.08 | 17.8% |

**Benefit**: Immediately see that GEMM operations dominate the graph launch, even though all have the same CPU op parent.

## Implementation Approach

**Incremental implementation:**

1. **This Issue**: Add kernel name categorization directly in GPU event analyzer and then use in the TreePerfAnalyzer and generate perf report. 

2. **Future PRs**: 
   - Refactor categorization logic from JAX-specific `JaxOpKeys` (in `util.py`) to use the new general categorizer
   - Other analysis modules can leverage the same categorizer as needed
   - Add extension to user can customize further


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Optional Kernel Name Categorizer in GPU Event Analysis #428

Background

Existing Implementation

Proposal

Proposed API

Example: Before and After

Before (current behavior)

After (with kernel categorization enabled)

Implementation Approach

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cpu_op	kernel_name	count	total_time_ms	avg_time_us
graph	kernel_mha	1	0.15	150
graph	void tensorrt_llm::common::scaleMatrix<...>	1	0.08	80
graph	nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT	1	0.22	220

cpu_op	kernel_category	kernel_name	count	total_time_ms	avg_time_us
graph	Flash Attention Forward	kernel_mha	1	0.15	150
graph	Memory Ops	void tensorrt_llm::common::scaleMatrix<...>	1	0.08	80
graph	GEMM	nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT	1	0.22	220

Implement Optional Kernel Name Categorizer in GPU Event Analysis #428

Description

Background

Existing Implementation

Proposal

Proposed API

Example: Before and After

Before (current behavior)

After (with kernel categorization enabled)

Implementation Approach

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions