You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Take out legacy functions (multibackend sessions)
2. Refactor tool comparison into a table
3. Add new features including nvtx and control knobs
4. Other minor fixes
Copy file name to clipboardExpand all lines: third_party/proton/README.md
+44-46Lines changed: 44 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Introduction
4
4
5
-
Proton is a lightweight profiler for Triton, designed to be used for code written in Python and to invoke underlying GPU kernels. Proton provides insightful information about the program context, metadata, and hardware performance metrics of the GPU kernels invoked.
5
+
Proton is a lightweight profiler for Triton that captures rich information about program context, metadata, and GPU kernel performance metrics, while keeping both runtime overhead and profile size minimal.
6
6
7
7
## Installation
8
8
@@ -85,6 +85,12 @@ with proton.scope("test2", {"bytes": 3000}):
85
85
foo[1,](x, y)
86
86
```
87
87
88
+
#### NVTX compatibility
89
+
90
+
Proton scopes coexist with NVTX ranges.
91
+
NVTX pushes and pops (for example, `torch.cuda.nvtx.range_push`) appear as nested scopes in the Proton profile, letting you correlate custom NVTX annotations with Proton's aggregated metrics.
92
+
93
+
88
94
### Backend and mode
89
95
90
96
Proton supports three profiling backends: `cupti`, `roctracer`, and `instrumentation`.
@@ -95,7 +101,7 @@ Proton supports three profiling backends: `cupti`, `roctracer`, and `instrumenta
95
101
96
102
By default, Proton automatically selects either `cupti` or `roctracer` as the backend based on your GPU driver. The `instrumentation` backend offers a wide range of mode options for fine-grained profiling, as detailed in the `mode.py` file.
97
103
98
-
#### Instruction Sampling
104
+
#### Instruction sampling
99
105
100
106
Proton supports instruction sampling on NVIDIA GPUs.
101
107
You may experience ~20x end-to-end overhead when using instruction sampling, although the overhead for each individual GPU kernel is negligible.
@@ -106,7 +112,7 @@ The following example demonstrates how to use instruction sampling:
mode=pmode.Default(granularity="warp_2") # collect metrics from every 2 warps
142
+
)
127
143
```
128
144
129
145
**Kernel-side usage:**
@@ -132,6 +148,7 @@ proton.start(
132
148
Instrumenting kernels written in Triton DSL is disable because Triton's higher-level IR undergoes
133
149
aggressive compiler rewrites (loop pipelining, instruction re-ordering, IR duplication, etc.).
134
150
These transformations can invalidate naïve instrumentation and lead to misleading results.
151
+
To enable instrumentation for Triton DSL, call `pl.enable_semantic("triton")` before `proton.start`.
135
152
136
153
```python
137
154
from triton.experimental import gluon
@@ -152,21 +169,6 @@ def kernel(...):
152
169
153
170
Advanced users can instrument either the `ttir` or `ttgir` intermediate representations for even finer-grained measurement. The relevant IR instructions are `proton.record start` and `proton.record end`. This can be combined with the environment variable `TRITON_KERNEL_OVERRIDE=1` for custom kernel overrides. For detailed steps, refer to the Triton [documentation](https://github.com/triton-lang/triton?tab=readme-ov-file#tips-for-hacking) under the **Kernel Override Steps** section. We have also assembled a [tutorial](tutorials/ttgir_override) that demonstrates how to use the IR-based instrumentation.
154
171
155
-
#### Merging profiles for postmortem analysis
156
-
157
-
We could use concurrent sessions to profile the same code region using different backends, and then merge the profiles using hatchet for postmortem analysis. In the following example, the `cupti` backend obtains different metrics than the `instrumentation` backend, and thus it makes sense to merge them using `GraphFrame.add` directly. Otherwise, if there are duplicate metrics, we could customize the `merge` logic or manipulate the dataframes.
When profiling in the command line mode, the `proton.start` and `proton.finalize` functions are automatically called before and after the script execution. Any `proton.start` and `proton.finalize` functions in the script are ignored. Also, in the command line mode, only a single *session* is supported. Therefore, `proton.deactivate(session_id=1)` is invalid, while `proton.deactivate(session_id=0)` is valid.
220
+
When profiling in the command line mode, the `proton.start` and `proton.finalize` functions are automatically called before and after the script execution. Any `proton.start` and `proton.finalize` functions in the script are ignored. Also, in the command line mode, only a single *session* is supported.
221
+
Therefore, `proton.deactivate(session_id=1)` is invalid, while `proton.deactivate(session_id=0)` is valid.
The dumped trace will be in the chrome trace format and can be visualized using the `chrome://tracing` tool in Chrome or the [perfetto](https://perfetto.dev) tool.
239
243
240
-
### Visualizing sorted profile data
241
-
242
244
In addition visualizing the profile data on terminal through Hatchet. A sorted list of the kernels by the first metric can be done using the --print-sorted flag with proton-viewer
prints the sorted kernels by the time/ns since it is the first listed.
249
-
250
250
More options can be found by running the following command.
251
251
252
252
```bash
253
253
proton-viewer -h
254
254
```
255
255
256
+
## Knobs
257
+
258
+
Triton's runtime has a centralized configuration system called *knobs* that controls various features and behaviors, including the following knobs are defined for Proton:
259
+
260
+
-`triton.knobs.proton.enable_nvtx` or `TRITON_ENABLE_NVTX` (default: `True`): Whether to enable NVTX ranges in Proton.
261
+
262
+
-`triton.knobs.proton.cupti_lib_dir` or `TRITON_CUPTI_LIB_DIR` (default: `<triton_root>/backends/nvidia/lib/cupti`): The directory of the CUPTI library.
263
+
256
264
## Advanced features and knowledge
257
265
258
266
### Thread management
@@ -325,35 +333,25 @@ with proton.scope("test"):
325
333
326
334
The call path of `foo1` will be `test->test1->state0`.
327
335
328
-
## Proton *vs* nsys
329
-
330
-
- Runtime overhead (up to 1.5x)
331
-
332
-
Proton has a lower profiling overhead than nsys. Even for workload with a large number of small GPU kernels, proton triggers less than ~1.5x overhead.
333
-
334
-
For GPU-bound workload, both proton and nsys has similar overhead, with little impact on the workload.
335
-
336
-
The lower overhead of proton is due to its less profiling metrics and callbacks compared to nsys.
337
-
338
-
- Profile size (significantly smaller than nsys)
339
-
340
-
nsys traces and records every GPU kernel, while proton aggregates the metrics of GPU kernels under the same calling context.
341
-
342
-
As a result, proton's profile size can be up to thousands of times smaller than nsys's profile size, depending on the running time.
Proton is designed to be portable and can be used on AMD GPUs. nsys only supports NVIDIA GPUs.
346
+
**Runtime overhead.**Proton typically keeps slowdown below roughly 1.5×, even for workloads with many short-lived kernels, because it collects fewer metrics and registers fewer callbacks. Nsight Systems and Nsight Compute both impose higher overhead, though they behave similarly to Proton on purely GPU-bound workloads.
347
347
348
-
- Insights (more insightful than nsys on triton kernels)
348
+
**Profile size.** Proton aggregates kernels that share a calling context, so profile files stay compact—sometimes thousands of times smaller than Nsight traces. Both Nsight tools record each GPU kernel individually, which grows traces quickly during long runs.
349
349
350
-
Proton can register hooks to analyze the metadata of triton kernels, while nsys cannot. **Note** that the hooks do add additional overhead to proton.
350
+
**Portability.**Proton already runs on AMD and NVIDIA GPUs and has a roadmap to extend instruction sampling to AMD hardware. Nsight Systems and Nsight Compute target NVIDIA GPUs exclusively.
351
351
352
-
## Proton *vs* ncu
352
+
**Triton insights.** Proton can register Triton-specific hooks that surface kernel metadata for richer analysis, at the cost of a small extra overhead. Neither Nsight tool offers comparable Triton integration.
353
353
354
-
Similar to the comparison between Proton and Nsight Systems (Nsys), Proton has a lower profiling overhead than Nsight Compute (NCU). We also plan to support instruction sampling on AMD GPUs.
355
-
However, Nsight Compute supports the collection of more detailed metrics than Proton, such as memory access patterns, memory transactions, and other instruction-level metrics.
356
-
In contrast, Proton only supports instruction sampling and is designed to be lightweight and portable.
354
+
**Metric depth.** Proton emphasizes lightweight metrics and instruction sampling for portability and fast iteration. Nsight Systems focuses on timeline-oriented metrics for NVIDIA GPUs, while Nsight Compute dives deeper into instruction-level details such as memory transactions and access patterns.
0 commit comments