Skip to content

Commit 80a5cfb

Browse files
authored
[PROTON][Experimental] Initialize instruction sampling support for NVIDIA GPUs (#4674)
This PR adds the "cupti_pcsampling" backend for collecting and attributing instruction samples to the corresponding GPU code, including the file path, function name, and line number. It currently serializes kernel execution so that kernel runtime and GPU samples can be collected in the same pass.
1 parent a6ecc75 commit 80a5cfb

File tree

29 files changed

+1115
-171
lines changed

29 files changed

+1115
-171
lines changed

.github/workflows/integration-tests.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ env:
2525
TRITON_BUILD_WITH_CLANG_LLD: "TRUE"
2626
TRITON_USE_ASSERT_ENABLED_LLVM: "TRUE"
2727
TRITON_DISABLE_LINE_INFO: 1
28+
PROTON_SKIP_PC_SAMPLING_TEST: 1
2829
jobs:
2930
Runner-Preparation:
3031
runs-on: ubuntu-latest

.github/workflows/integration-tests.yml.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ env:
2727
TRITON_BUILD_WITH_CLANG_LLD: "TRUE"
2828
TRITON_USE_ASSERT_ENABLED_LLVM: "TRUE"
2929
TRITON_DISABLE_LINE_INFO: 1
30-
30+
PROTON_SKIP_PC_SAMPLING_TEST: 1
3131

3232
jobs:
3333
Runner-Preparation:

third_party/proton/README.md

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ flops64: float # The number of 64-bit floating-point operations
119119
bytes: int # The number of bytes expected to be transferred
120120
```
121121

122-
### Command Line
122+
### Command line
123123

124124
Proton can be used as a command-line tool to profile Python scripts and Pytest tests.
125125
The following examples demonstrate how to use Proton command-line.
@@ -149,6 +149,22 @@ More options can be found by running the following command.
149149
proton-viewer -h
150150
```
151151

152+
### Instruction sampling (experimental)
153+
154+
Proton supports instruction sampling on NVIDIA GPUs.
155+
Please note that this is an experimental feature and may not work on all GPUs.
156+
You may experience ~20x end-to-end overhead when using instruction sampling, although the overhead for each individual GPU kernel is negligible.
157+
The overhead is mostly caused by data transfer and processing on the CPU.
158+
Additionally, the proton-viewer options `-i <regex> -d <depth> -t <threshold>` can be helpful for filtering out GPU kernels that are not of interest.
159+
The following example demonstrates how to use instruction sampling:
160+
161+
```python
162+
import triton.profiler as proton
163+
164+
165+
proton.start(name="profile_name", context="shadow", backend="cupti_pcsampling")
166+
```
167+
152168
## Proton *vs* nsys
153169

154170
- Runtime overhead (up to 1.5x)
@@ -173,11 +189,24 @@ Proton is designed to be portable and can be used on AMD GPUs. nsys only support
173189

174190
Proton can register hooks to analyze the metadata of triton kernels, while nsys cannot. **Note** that the hooks do add additional overhead to proton.
175191

176-
## Known Issues
192+
## Proton *vs* ncu
193+
194+
Similar to the comparison between Proton and Nsight Systems (Nsys), Proton has a lower profiling overhead than Nsight Compute (NCU). We also plan to support instruction sampling on AMD GPUs.
195+
However, Nsight Compute supports the collection of more detailed metrics than Proton, such as memory access patterns, memory transactions, and other instruction-level metrics.
196+
In contrast, Proton only supports instruction sampling and is designed to be lightweight and portable.
197+
198+
## Known issues
177199

178-
- CUDA Graph
200+
- CUDA graph
179201

180202
`hooks` cannot be used to accurately accumulate the number of FLOPs in CUDA graph mode profiling because kernels are captured and launched separately; metrics are not accumulated when kernels are launched in graph mode. This issue can be circumvented by using `scope` to supply FLOPs.
181203

182204
If profiling is initiated after CUDA graph capturing, there may be minor memory leak issues.
183205
This is because the number of kernels in a graph instance (i.e., `cuGraphExec`) is unknown, preventing the deletion of mappings between the kernel ID and the graph ID.
206+
207+
- Instruction sampling
208+
209+
If you encounter permission related problems when using instruction sampling, you can lookup this [page](https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters) for help.
210+
211+
The overhead of instruction sampling on NVIDIA GPUs is about 20x using Proton because we haven't enabled continuous sampling yet.
212+
Continuous sampling can allow for more runtime optimizations, but it makes it more challenging to attribute performance data back to the GPU kernels because: (1) it enables profiling of concurrent kernels, (2) it doesn't allow profiling of time and instruction samples simultaneously, and (3) it works best if we have a separate thread dedicated to attributing instruction samples to the GPU kernels

third_party/proton/csrc/include/Data/Metric.h

Lines changed: 73 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
namespace proton {
99

10-
enum class MetricKind { Flexible, Kernel, Count };
10+
enum class MetricKind { Flexible, Kernel, PCSampling, Count };
1111

1212
using MetricValueType = std::variant<uint64_t, int64_t, double, std::string>;
1313

@@ -143,8 +143,78 @@ class KernelMetric : public Metric {
143143
const static inline bool AGGREGABLE[kernelMetricKind::Count] = {
144144
false, false, true, true, false, false};
145145
const static inline std::string VALUE_NAMES[kernelMetricKind::Count] = {
146-
"StartTime (ns)", "EndTime (ns)", "Count",
147-
"Time (ns)", "DeviceId", "DeviceType",
146+
"start_time (ns)", "end_time (ns)", "count",
147+
"time (ns)", "device_id", "device_type",
148+
};
149+
};
150+
151+
class PCSamplingMetric : public Metric {
152+
public:
153+
enum PCSamplingMetricKind : int {
154+
NumSamples,
155+
NumStalledSamples,
156+
StalledBranchResolving,
157+
StalledNoInstruction,
158+
StalledShortScoreboard,
159+
StalledWait,
160+
StalledLongScoreboard,
161+
StalledTexThrottle,
162+
StalledBarrier,
163+
StalledMembar,
164+
StalledIMCMiss,
165+
StalledMIOThrottle,
166+
StalledMathPipeThrottle,
167+
StalledDrain,
168+
StalledLGThrottle,
169+
StalledNotSelected,
170+
StalledMisc,
171+
StalledDispatchStall,
172+
StalledSleeping,
173+
StalledSelected,
174+
Count,
175+
};
176+
177+
PCSamplingMetric()
178+
: Metric(MetricKind::PCSampling, PCSamplingMetricKind::Count) {}
179+
180+
PCSamplingMetric(PCSamplingMetricKind kind, uint64_t samples,
181+
uint64_t stalledSamples)
182+
: PCSamplingMetric() {
183+
this->values[kind] = stalledSamples;
184+
this->values[PCSamplingMetricKind::NumSamples] = samples;
185+
this->values[PCSamplingMetricKind::NumStalledSamples] = stalledSamples;
186+
}
187+
188+
virtual const std::string getName() const { return "PCSamplingMetric"; }
189+
190+
virtual const std::string getValueName(int valueId) const {
191+
return VALUE_NAMES[valueId];
192+
}
193+
194+
virtual bool isAggregable(int valueId) const { return true; }
195+
196+
private:
197+
const static inline std::string VALUE_NAMES[PCSamplingMetricKind::Count] = {
198+
"num_samples",
199+
"num_stalled_samples",
200+
"stalled_branch_resolving",
201+
"stalled_no_instruction",
202+
"stalled_short_scoreboard",
203+
"stalled_wait",
204+
"stalled_long_scoreboard",
205+
"stalled_tex_throttle",
206+
"stalled_barrier",
207+
"stalled_membar",
208+
"stalled_imc_miss",
209+
"stalled_mio_throttle",
210+
"stalled_math_pipe_throttle",
211+
"stalled_drain",
212+
"stalled_lg_throttle",
213+
"stalled_not_Selected",
214+
"stalled_misc",
215+
"stalled_dispatch_stall",
216+
"stalled_sleeping",
217+
"stalled_selected",
148218
};
149219
};
150220

third_party/proton/csrc/include/Driver/Dispatch.h

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -63,17 +63,17 @@ template <typename ExternLib> class Dispatch {
6363
*lib = dlopen(name, RTLD_NOLOAD);
6464
}
6565
if (*lib == nullptr) {
66-
// If not found, try to load it from the default path
66+
// If not found, try to load it from LD_LIBRARY_PATH
67+
*lib = dlopen(name, RTLD_LOCAL | RTLD_LAZY);
68+
}
69+
if (*lib == nullptr) {
70+
// If still not found, try to load it from the default path
6771
auto dir = std::string(ExternLib::defaultDir);
6872
if (dir.length() > 0) {
6973
auto fullPath = dir + "/" + name;
7074
*lib = dlopen(fullPath.c_str(), RTLD_LOCAL | RTLD_LAZY);
7175
}
7276
}
73-
if (*lib == nullptr) {
74-
// If still not found, try to load it from LD_LIBRARY_PATH
75-
*lib = dlopen(name, RTLD_LOCAL | RTLD_LAZY);
76-
}
7777
if (*lib == nullptr) {
7878
throw std::runtime_error("Could not find `" + std::string(name) +
7979
"`. Make sure it is in your "

third_party/proton/csrc/include/Driver/GPU/CuptiApi.h

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,17 @@
22
#define PROTON_DRIVER_GPU_CUPTI_H_
33

44
#include "cupti.h"
5+
#include "cupti_pcsampling.h"
56

67
namespace proton {
78

89
namespace cupti {
910

11+
template <bool CheckSuccess> CUptiResult getVersion(uint32_t *version);
12+
13+
template <bool CheckSuccess>
14+
CUptiResult getContextId(CUcontext context, uint32_t *pCtxId);
15+
1016
template <bool CheckSuccess>
1117
CUptiResult activityRegisterCallbacks(
1218
CUpti_BuffersCallbackRequestFunc funcBufferRequested,
@@ -66,6 +72,40 @@ CUptiResult getGraphExecId(CUgraphExec graph, uint32_t *pId);
6672
template <bool CheckSuccess>
6773
CUptiResult getGraphId(CUgraph graph, uint32_t *pId);
6874

75+
template <bool CheckSuccess>
76+
CUptiResult getCubinCrc(CUpti_GetCubinCrcParams *pParams);
77+
78+
template <bool CheckSuccess>
79+
CUptiResult
80+
getSassToSourceCorrelation(CUpti_GetSassToSourceCorrelationParams *pParams);
81+
82+
template <bool CheckSuccess>
83+
CUptiResult
84+
pcSamplingGetNumStallReasons(CUpti_PCSamplingGetNumStallReasonsParams *pParams);
85+
86+
template <bool CheckSuccess>
87+
CUptiResult
88+
pcSamplingGetStallReasons(CUpti_PCSamplingGetStallReasonsParams *pParams);
89+
90+
template <bool CheckSuccess>
91+
CUptiResult pcSamplingSetConfigurationAttribute(
92+
CUpti_PCSamplingConfigurationInfoParams *pParams);
93+
94+
template <bool CheckSuccess>
95+
CUptiResult pcSamplingEnable(CUpti_PCSamplingEnableParams *pParams);
96+
97+
template <bool CheckSuccess>
98+
CUptiResult pcSamplingDisable(CUpti_PCSamplingDisableParams *pParams);
99+
100+
template <bool CheckSuccess>
101+
CUptiResult pcSamplingGetData(CUpti_PCSamplingGetDataParams *pParams);
102+
103+
template <bool CheckSuccess>
104+
CUptiResult pcSamplingStart(CUpti_PCSamplingStartParams *pParams);
105+
106+
template <bool CheckSuccess>
107+
CUptiResult pcSamplingStop(CUpti_PCSamplingStopParams *pParams);
108+
69109
} // namespace cupti
70110

71111
} // namespace proton
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
#ifndef PROTON_PROFILER_CUPTI_PC_SAMPLING_H_
2+
#define PROTON_PROFILER_CUPTI_PC_SAMPLING_H_
3+
4+
#include "CuptiProfiler.h"
5+
#include "Driver/GPU/CudaApi.h"
6+
#include "Driver/GPU/CuptiApi.h"
7+
#include "Utility/Map.h"
8+
#include "Utility/Singleton.h"
9+
#include <atomic>
10+
#include <mutex>
11+
12+
namespace proton {
13+
14+
struct CubinData {
15+
size_t cubinCrc;
16+
const char *cubin;
17+
size_t cubinSize;
18+
19+
struct LineInfoKey {
20+
uint32_t functionIndex;
21+
uint64_t pcOffset;
22+
23+
bool operator<(const LineInfoKey &other) const {
24+
return functionIndex < other.functionIndex ||
25+
(functionIndex == other.functionIndex &&
26+
pcOffset < other.pcOffset);
27+
}
28+
};
29+
30+
struct LineInfoValue {
31+
uint32_t lineNumber{};
32+
const std::string functionName{};
33+
const std::string dirName{};
34+
const std::string fileName{};
35+
36+
LineInfoValue() = default;
37+
38+
LineInfoValue(uint32_t lineNumber, const std::string &functionName,
39+
const std::string &dirName, const std::string &fileName)
40+
: lineNumber(lineNumber), functionName(functionName), dirName(dirName),
41+
fileName(fileName) {}
42+
};
43+
44+
std::map<LineInfoKey, LineInfoValue> lineInfo;
45+
};
46+
47+
struct ConfigureData {
48+
ConfigureData() = default;
49+
50+
~ConfigureData() {
51+
if (stallReasonNames) {
52+
for (size_t i = 0; i < numStallReasons; i++) {
53+
if (stallReasonNames[i])
54+
std::free(stallReasonNames[i]);
55+
}
56+
std::free(stallReasonNames);
57+
}
58+
if (stallReasonIndices)
59+
std::free(stallReasonIndices);
60+
if (pcSamplingData.pPcData) {
61+
for (size_t i = 0; i < numValidStallReasons; ++i) {
62+
std::free(pcSamplingData.pPcData[i].stallReason);
63+
}
64+
std::free(pcSamplingData.pPcData);
65+
}
66+
}
67+
68+
void initialize(CUcontext context);
69+
70+
CUpti_PCSamplingConfigurationInfo configureStallReasons();
71+
CUpti_PCSamplingConfigurationInfo configureSamplingPeriod();
72+
CUpti_PCSamplingConfigurationInfo configureSamplingBuffer();
73+
CUpti_PCSamplingConfigurationInfo configureScratchBuffer();
74+
CUpti_PCSamplingConfigurationInfo configureHardwareBufferSize();
75+
CUpti_PCSamplingConfigurationInfo configureStartStopControl();
76+
CUpti_PCSamplingConfigurationInfo configureCollectionMode();
77+
78+
// The amount of data reserved on the GPU
79+
static constexpr size_t HardwareBufferSize = 128 * 1024 * 1024;
80+
// The amount of data copied from the hardware buffer each time
81+
static constexpr size_t ScratchBufferSize = 16 * 1024 * 1024;
82+
// The number of PCs copied from the scratch buffer each time
83+
static constexpr size_t DataBufferPCCount = 1024;
84+
// The sampling period in cycles = 2^frequency
85+
static constexpr uint32_t DefaultFrequency = 10;
86+
87+
CUcontext context{};
88+
uint32_t contextId;
89+
uint32_t numStallReasons{};
90+
uint32_t numValidStallReasons{};
91+
char **stallReasonNames{};
92+
uint32_t *stallReasonIndices{};
93+
std::map<size_t, size_t> stallReasonIndexToMetricIndex{};
94+
std::set<size_t> notIssuedStallReasonIndices{};
95+
CUpti_PCSamplingData pcSamplingData{};
96+
// The memory storing configuration information has to be kept alive during
97+
// the profiling session
98+
std::vector<CUpti_PCSamplingConfigurationInfo> configurationInfos;
99+
};
100+
101+
class CuptiPCSampling : public Singleton<CuptiPCSampling> {
102+
103+
public:
104+
CuptiPCSampling() = default;
105+
virtual ~CuptiPCSampling() = default;
106+
107+
void initialize(CUcontext context);
108+
109+
void start(CUcontext context);
110+
111+
void stop(CUcontext context, uint64_t externId, bool isAPI);
112+
113+
void finalize(CUcontext context);
114+
115+
void loadModule(const char *cubin, size_t cubinSize);
116+
117+
void unloadModule(const char *cubin, size_t cubinSize);
118+
119+
private:
120+
ConfigureData *getConfigureData(uint32_t contextId);
121+
122+
CubinData *getCubinData(uint64_t cubinCrc);
123+
124+
void processPCSamplingData(ConfigureData *configureData, uint64_t externId,
125+
bool isAPI);
126+
127+
ThreadSafeMap<uint32_t, ConfigureData> contextIdToConfigureData;
128+
// In case the same cubin is loaded multiple times, we need to keep track of
129+
// all of them
130+
ThreadSafeMap<size_t, std::pair<CubinData, /*count=*/size_t>>
131+
cubinCrcToCubinData;
132+
ThreadSafeSet<uint32_t> contextInitialized;
133+
134+
std::atomic<bool> pcSamplingStarted{false};
135+
std::mutex pcSamplingMutex{};
136+
std::mutex contextMutex{};
137+
};
138+
139+
} // namespace proton
140+
141+
#endif // PROTON_PROFILER_CUPTI_PC_SAMPLING_H_

third_party/proton/csrc/include/Profiler/CuptiProfiler.h renamed to third_party/proton/csrc/include/Profiler/Cupti/CuptiProfiler.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#ifndef PROTON_PROFILER_CUPTI_PROFILER_H_
22
#define PROTON_PROFILER_CUPTI_PROFILER_H_
33

4-
#include "GPUProfiler.h"
4+
#include "Profiler/GPUProfiler.h"
55

66
namespace proton {
77

0 commit comments

Comments
 (0)