Skip to content

Commit 240a5c8

Browse files
authored
[PROTON] Improve default buffer size description (#8650)
1 parent 4f8712b commit 240a5c8

File tree

2 files changed

+10
-5
lines changed

2 files changed

+10
-5
lines changed

third_party/proton/Dialect/lib/ProtonToProtonGPU/ProtonToProtonGPUPass.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,7 @@ class ConvertProtonToProtonGPUPass
243243
if (bufferSize > 0)
244244
allocBufferSize = bufferSize.getValue();
245245
else
246-
allocBufferSize = 16384 * segmentNum;
246+
allocBufferSize = 16384 * segmentNum; // 16KB per profiling unit
247247
} else {
248248
mlir::emitError(loc, "buffer-type not supported");
249249
return failure();

third_party/proton/tutorials/intra_kernel/README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,16 +63,19 @@ python3 example_dsl.py --increase-accuracy
6363
## Understanding Timeline Traces
6464

6565
### Time Representation
66+
6667
- **Scope Duration**: Displayed in cycles for precise measurement
6768
- **Threadblock Start Times**: Measured in nanoseconds using global timing
6869
- **Chrome Trace Format**: Assumes 1GHz GPU frequency for consistent time units (ns)
6970

7071
### Circular Buffer System
72+
7173
- **Backend Storage**: Uses circular buffer for runtime profiling on each CTA
7274
- **Buffer Overflow**: When full, earlier events are dropped with warnings in trace generation
7375
- **Event Window**: Displays sliding window (the latest window) of recorded events in timeline
7476

7577
### Finalize Time Measurement
78+
7679
- **Definition**: Captures `Finalize Time` when kernel execution completes
7780
- **Meaning**: Shows overhead of dumping profiling data from buffer to global memory (appears as a field in Chrome trace viewer tab)
7881

@@ -89,10 +92,10 @@ python3 example_dsl.py --increase-accuracy
8992

9093
### Buffer Configuration
9194

92-
| Parameter | Options | Description |
93-
|-----------|---------|-------------|
94-
| `buffer_type` | `shared`, `global`| Storage location for profiling buffer |
95-
| `buffer_size` | `N` | Byte size of the profiling buffer (default: infer a small fraction of shared memory if `shared`. For `global`, 16384 bytes * num_sampled_warp) |
95+
| Buffer Type | Options | Default | Description |
96+
|-------------|---------|---------|-------------|
97+
| `buffer_type` | `shared`, `global` | `shared` | Determines whether profiling data is stored in shared or global memory |
98+
| `buffer_size` | Integer | `shared`: Maximum size without reducing occupancy; `global`: 16KB × number of profiled units (e.g., warp) | Controls per-block profiling buffer size in bytes |
9699

97100
### Sampling Configuration
98101

@@ -106,11 +109,13 @@ python3 example_dsl.py --increase-accuracy
106109
## Output Formats
107110

108111
### Timeline Traces
112+
109113
- **Format**: Chrome trace format (`.chrome_trace` files)
110114
- **Viewer**: Chrome browser at `chrome://tracing` or [`Perfetto`](https://ui.perfetto.dev/)
111115
- **Content**: Detailed timeline with scope durations
112116

113117
### Operation Measurements
118+
114119
- **Format**: Hatchet format (`.hatchet` files)
115120
- **Viewer**: `proton-viewer -m normalized_cycles <filename>.hatchet`
116121
(with `-m cycles` showing sum of all cycles across the GPU, `normalized_cycles` for per-warp averaged cycles)

0 commit comments

Comments
 (0)