Skip to content

Commit 0532264

Browse files
authored
[PROTON] Update README (#8319)
1. Take out legacy functions (multibackend sessions) 2. Refactor tool comparison into a table 3. Add new features including nvtx and control knobs 4. Other minor fixes
1 parent 2263431 commit 0532264

File tree

1 file changed

+44
-46
lines changed

1 file changed

+44
-46
lines changed

third_party/proton/README.md

Lines changed: 44 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction
44

5-
Proton is a lightweight profiler for Triton, designed to be used for code written in Python and to invoke underlying GPU kernels. Proton provides insightful information about the program context, metadata, and hardware performance metrics of the GPU kernels invoked.
5+
Proton is a lightweight profiler for Triton that captures rich information about program context, metadata, and GPU kernel performance metrics, while keeping both runtime overhead and profile size minimal.
66

77
## Installation
88

@@ -85,6 +85,12 @@ with proton.scope("test2", {"bytes": 3000}):
8585
foo[1,](x, y)
8686
```
8787

88+
#### NVTX compatibility
89+
90+
Proton scopes coexist with NVTX ranges.
91+
NVTX pushes and pops (for example, `torch.cuda.nvtx.range_push`) appear as nested scopes in the Proton profile, letting you correlate custom NVTX annotations with Proton's aggregated metrics.
92+
93+
8894
### Backend and mode
8995

9096
Proton supports three profiling backends: `cupti`, `roctracer`, and `instrumentation`.
@@ -95,7 +101,7 @@ Proton supports three profiling backends: `cupti`, `roctracer`, and `instrumenta
95101

96102
By default, Proton automatically selects either `cupti` or `roctracer` as the backend based on your GPU driver. The `instrumentation` backend offers a wide range of mode options for fine-grained profiling, as detailed in the `mode.py` file.
97103

98-
#### Instruction Sampling
104+
#### Instruction sampling
99105

100106
Proton supports instruction sampling on NVIDIA GPUs.
101107
You may experience ~20x end-to-end overhead when using instruction sampling, although the overhead for each individual GPU kernel is negligible.
@@ -106,7 +112,7 @@ The following example demonstrates how to use instruction sampling:
106112
```python
107113
import triton.profiler as proton
108114

109-
proton.start(name="profile_name", context="shadow", backend="cupti_pcsampling")
115+
proton.start(name="profile_name", context="shadow", backend="cupti", mode="pcsampling")
110116
```
111117

112118
#### Instrumentation
@@ -124,6 +130,16 @@ proton.start(
124130
backend="instrumentation",
125131
mode="<mode0>=<option0>:<mode1>=<option1>:..."
126132
)
133+
134+
# or
135+
136+
import triton.profiler.mode as pmode
137+
138+
proton.start(
139+
name="profile_name",
140+
backend="instrumentation",
141+
mode=pmode.Default(granularity="warp_2") # collect metrics from every 2 warps
142+
)
127143
```
128144

129145
**Kernel-side usage:**
@@ -132,6 +148,7 @@ proton.start(
132148
Instrumenting kernels written in Triton DSL is disable because Triton's higher-level IR undergoes
133149
aggressive compiler rewrites (loop pipelining, instruction re-ordering, IR duplication, etc.).
134150
These transformations can invalidate naïve instrumentation and lead to misleading results.
151+
To enable instrumentation for Triton DSL, call `pl.enable_semantic("triton")` before `proton.start`.
135152

136153
```python
137154
from triton.experimental import gluon
@@ -152,21 +169,6 @@ def kernel(...):
152169

153170
Advanced users can instrument either the `ttir` or `ttgir` intermediate representations for even finer-grained measurement. The relevant IR instructions are `proton.record start` and `proton.record end`. This can be combined with the environment variable `TRITON_KERNEL_OVERRIDE=1` for custom kernel overrides. For detailed steps, refer to the Triton [documentation](https://github.com/triton-lang/triton?tab=readme-ov-file#tips-for-hacking) under the **Kernel Override Steps** section. We have also assembled a [tutorial](tutorials/ttgir_override) that demonstrates how to use the IR-based instrumentation.
154171

155-
#### Merging profiles for postmortem analysis
156-
157-
We could use concurrent sessions to profile the same code region using different backends, and then merge the profiles using hatchet for postmortem analysis. In the following example, the `cupti` backend obtains different metrics than the `instrumentation` backend, and thus it makes sense to merge them using `GraphFrame.add` directly. Otherwise, if there are duplicate metrics, we could customize the `merge` logic or manipulate the dataframes.
158-
159-
```python
160-
161-
import triton.profiler as proton
162-
163-
proton.start(name="profile_name0", context="shadow", backend="cupti")
164-
proton.start(name="profile_name1", context="shadow", backend="instrumentation")
165-
166-
...
167-
168-
proton.finalize()
169-
```
170172

171173
### Hook
172174

@@ -206,6 +208,7 @@ bytes: int # The number of bytes expected to be transferred
206208

207209
Proton can be used as a command-line tool to profile Python scripts and Pytest tests.
208210
The following examples demonstrate how to use Proton command-line.
211+
Detailed options can be found by running `proton -h`.
209212

210213
```bash
211214
proton [options] script.py [script_args] [script_options]
@@ -214,7 +217,8 @@ python -m triton.profiler.proton [options] script.py [script_args] [script_optio
214217
proton --instrument=[instrumentation pass] script.py
215218
```
216219

217-
When profiling in the command line mode, the `proton.start` and `proton.finalize` functions are automatically called before and after the script execution. Any `proton.start` and `proton.finalize` functions in the script are ignored. Also, in the command line mode, only a single *session* is supported. Therefore, `proton.deactivate(session_id=1)` is invalid, while `proton.deactivate(session_id=0)` is valid.
220+
When profiling in the command line mode, the `proton.start` and `proton.finalize` functions are automatically called before and after the script execution. Any `proton.start` and `proton.finalize` functions in the script are ignored. Also, in the command line mode, only a single *session* is supported.
221+
Therefore, `proton.deactivate(session_id=1)` is invalid, while `proton.deactivate(session_id=0)` is valid.
218222

219223
### Visualizing the profile data
220224

@@ -237,22 +241,26 @@ proton.start(name="profile_name", data="trace")
237241

238242
The dumped trace will be in the chrome trace format and can be visualized using the `chrome://tracing` tool in Chrome or the [perfetto](https://perfetto.dev) tool.
239243

240-
### Visualizing sorted profile data
241-
242244
In addition visualizing the profile data on terminal through Hatchet. A sorted list of the kernels by the first metric can be done using the --print-sorted flag with proton-viewer
243245

244246
```bash
245247
proton-viewer -m time/ns,time/% <profile.hatchet> --print-sorted
246248
```
247249

248-
prints the sorted kernels by the time/ns since it is the first listed.
249-
250250
More options can be found by running the following command.
251251

252252
```bash
253253
proton-viewer -h
254254
```
255255

256+
## Knobs
257+
258+
Triton's runtime has a centralized configuration system called *knobs* that controls various features and behaviors, including the following knobs are defined for Proton:
259+
260+
- `triton.knobs.proton.enable_nvtx` or `TRITON_ENABLE_NVTX` (default: `True`): Whether to enable NVTX ranges in Proton.
261+
262+
- `triton.knobs.proton.cupti_lib_dir` or `TRITON_CUPTI_LIB_DIR` (default: `<triton_root>/backends/nvidia/lib/cupti`): The directory of the CUPTI library.
263+
256264
## Advanced features and knowledge
257265

258266
### Thread management
@@ -325,35 +333,25 @@ with proton.scope("test"):
325333

326334
The call path of `foo1` will be `test->test1->state0`.
327335

328-
## Proton *vs* nsys
329-
330-
- Runtime overhead (up to 1.5x)
331-
332-
Proton has a lower profiling overhead than nsys. Even for workload with a large number of small GPU kernels, proton triggers less than ~1.5x overhead.
333-
334-
For GPU-bound workload, both proton and nsys has similar overhead, with little impact on the workload.
335-
336-
The lower overhead of proton is due to its less profiling metrics and callbacks compared to nsys.
337-
338-
- Profile size (significantly smaller than nsys)
339-
340-
nsys traces and records every GPU kernel, while proton aggregates the metrics of GPU kernels under the same calling context.
341-
342-
As a result, proton's profile size can be up to thousands of times smaller than nsys's profile size, depending on the running time.
336+
## Proton *vs* Nsight tools
343337

344-
- Portability (support different GPUs)
338+
| Aspect | Proton | Nsight Systems | Nsight Compute |
339+
| --- | --- | --- | --- |
340+
| Runtime overhead | Lower overhead | Higher overhead | Higher overhead |
341+
| Profile size | Compact profiles and traces | Large traces | Large traces |
342+
| Portability | Multi vendor | Nvidia only | Nvidia only |
343+
| Triton insights | Metadata hooks | No hooks | No hooks |
344+
| Metric depth | Lightweight metrics | Timeline metrics | Detailed metrics |
345345

346-
Proton is designed to be portable and can be used on AMD GPUs. nsys only supports NVIDIA GPUs.
346+
**Runtime overhead.** Proton typically keeps slowdown below roughly 1.5×, even for workloads with many short-lived kernels, because it collects fewer metrics and registers fewer callbacks. Nsight Systems and Nsight Compute both impose higher overhead, though they behave similarly to Proton on purely GPU-bound workloads.
347347

348-
- Insights (more insightful than nsys on triton kernels)
348+
**Profile size.** Proton aggregates kernels that share a calling context, so profile files stay compact—sometimes thousands of times smaller than Nsight traces. Both Nsight tools record each GPU kernel individually, which grows traces quickly during long runs.
349349

350-
Proton can register hooks to analyze the metadata of triton kernels, while nsys cannot. **Note** that the hooks do add additional overhead to proton.
350+
**Portability.** Proton already runs on AMD and NVIDIA GPUs and has a roadmap to extend instruction sampling to AMD hardware. Nsight Systems and Nsight Compute target NVIDIA GPUs exclusively.
351351

352-
## Proton *vs* ncu
352+
**Triton insights.** Proton can register Triton-specific hooks that surface kernel metadata for richer analysis, at the cost of a small extra overhead. Neither Nsight tool offers comparable Triton integration.
353353

354-
Similar to the comparison between Proton and Nsight Systems (Nsys), Proton has a lower profiling overhead than Nsight Compute (NCU). We also plan to support instruction sampling on AMD GPUs.
355-
However, Nsight Compute supports the collection of more detailed metrics than Proton, such as memory access patterns, memory transactions, and other instruction-level metrics.
356-
In contrast, Proton only supports instruction sampling and is designed to be lightweight and portable.
354+
**Metric depth.** Proton emphasizes lightweight metrics and instruction sampling for portability and fast iteration. Nsight Systems focuses on timeline-oriented metrics for NVIDIA GPUs, while Nsight Compute dives deeper into instruction-level details such as memory transactions and access patterns.
357355

358356
## Known issues
359357

0 commit comments

Comments
 (0)