Skip to content

Commit ac1b911

Browse files
Kernels compilation profiling doc (#4765)
1 parent 5ac1608 commit ac1b911

File tree

1 file changed

+164
-0
lines changed

1 file changed

+164
-0
lines changed
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
=============================
2+
Profiling kernels compilation
3+
=============================
4+
5+
If a kernel takes too much time to compile, the first step to start profiling it
6+
is to run it with the following environment variables:
7+
8+
.. code-block:: bash
9+
10+
export TRITON_ALWAYS_COMPILE=1
11+
export MLIR_ENABLE_TIMING=1
12+
export LLVM_ENABLE_TIMING=1
13+
python -m pytest python/test/unit/intel/test_mxfp_matmul.py::test_mxfp_matmul[True-True-float4-float4-True-True-1-128-128-128-1024-512-512] --device=xpu -s
14+
15+
For the environment variables description, see the README.md file in the root directory of the project.
16+
The output will contain the statistic for each phase of the compilation and for MLIR passes.
17+
Here is an example:
18+
19+
.. code-block:: text
20+
21+
===-------------------------------------------------------------------------===
22+
... Execution time report ...
23+
===-------------------------------------------------------------------------===
24+
Total Execution Time: 483.1849 seconds
25+
26+
----User Time---- ----Wall Time---- ----Name----
27+
0.0012 ( 0.0%) 0.0012 ( 0.0%) SCFToControlFlowPass
28+
0.0015 ( 0.0%) 0.0015 ( 0.0%) ConvertIndexToLLVMPass
29+
0.0516 ( 0.0%) 0.0516 ( 0.0%) IntelAllocateSharedMemory
30+
0.0011 ( 0.0%) 0.0011 ( 0.0%) TritonGPUGlobalScratchAllocationPass
31+
0.5947 ( 0.1%) 0.5947 ( 0.1%) ConvertTritonIntelGPUToLLVM
32+
0.1232 ( 0.0%) 0.1232 ( 0.0%) ConvertTritonGENToLLVM
33+
0.3430 ( 0.1%) 0.3430 ( 0.1%) TritonIntelGPURewriteStackPtr
34+
15.3707 ( 3.1%) 15.3707 ( 3.2%) Canonicalizer
35+
0.1650 ( 0.0%) 0.1650 ( 0.0%) CSE
36+
0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo
37+
0.0775 ( 0.0%) 0.0775 ( 0.0%) ArithToLLVMConversionPass
38+
0.1934 ( 0.0%) 0.1934 ( 0.0%) Canonicalizer
39+
0.1046 ( 0.0%) 0.1046 ( 0.0%) CSE
40+
0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo
41+
0.1470 ( 0.0%) 0.1470 ( 0.0%) SymbolDCE
42+
0.0876 ( 0.0%) 0.0876 ( 0.0%) LLVMDIScope
43+
483.1849 ( 96.6%) 465.9228 ( 96.4%) Rest
44+
500.4471 (100.0%) 483.1849 (100.0%) Total
45+
49 function calls in 465.865 seconds
46+
47+
What we can see here is that the most time-consuming part of the compilation is something,
48+
which is called "Rest" and we don't know what it is. For further investigation, we can
49+
use the Python's `cProfiler`. To profile individual functions,
50+
we can use the following decorator:
51+
52+
.. code-block:: python
53+
54+
def profile(func):
55+
import cProfile
56+
import pstats
57+
import functools
58+
59+
@functools.wraps(func)
60+
def wrapper(*args, **kwargs):
61+
print(f"Profiling {func.__qualname__}")
62+
pr = cProfile.Profile()
63+
pr.enable()
64+
try:
65+
return func(*args, **kwargs)
66+
finally:
67+
pr.disable()
68+
ps = pstats.Stats(pr).sort_stats('cumulative')
69+
ps.print_stats(20)
70+
return wrapper
71+
72+
After decorating the `make_llir` function:
73+
74+
.. code-block:: python
75+
76+
@profile
77+
def make_llir(...):
78+
...
79+
80+
and running the program again, we will get the following results:
81+
82+
.. code-block:: text
83+
84+
ncalls tottime percall cumtime percall filename:lineno(function)
85+
1 465.552 465.552 465.552 465.552 {built-in method triton._C.libtriton.intel.optimize_module}
86+
1 0.231 0.231 0.231 0.231 {built-in method triton._C.libtriton.llvm.to_module}
87+
...
88+
89+
The new profile information indicates that most of the time is spent in the `optimize_module` function, which is a native function,
90+
written in C++. To profile this function, we will use `gperftools`. On an Ubuntu system, we need
91+
the following packages to be installed:
92+
93+
.. code-block:: bash
94+
95+
sudo apt install google-perftools libgoogle-perftools-dev
96+
97+
We also need to link the Triton library with the `profiler` by editing following line in the
98+
`CMakeLists.txt` file:
99+
100+
.. code-block:: cmake
101+
102+
--target_link_libraries(TritonXPU PRIVATE Python3::Module pybind11::headers)
103+
++target_link_libraries(TritonXPU PRIVATE Python3::Module pybind11::headers profiler)
104+
105+
The `optimize_module` function is a python binding implemented in `third_party/intel/triton_xpu.cc`.
106+
To profile it, we need to add the following lines to the code:
107+
108+
.. code-block:: cpp
109+
110+
++#include <gperftools/profiler.h>
111+
112+
...
113+
114+
"optimize_module",
115+
[](llvm::Module *mod, const llvm::OptimizationLevel &opt,
116+
std::string arch, std::string features, std::vector<std::string> flags,
117+
bool enable_fp_fusion) {
118+
++ProfilerStart("optimize_module.prof");
119+
...
120+
++ProfilerStop();
121+
122+
In the code above, the `"optimize_module.prof"` string is the file path, where the profiling results will be saved.
123+
After running the program, we will get the `optimize_module.prof` binary file, which can be visualized using the
124+
`pprof-symbolize` tool:
125+
126+
.. code-block:: bash
127+
128+
pprof-symbolize --text /usr/bin/python3 optimize_module.prof
129+
130+
Total: 42300 samples
131+
13378 31.6% 31.6% 14589 34.5% __default_morecore@GLIBC_2.2.5
132+
7939 18.8% 50.4% 7939 18.8% llvm::APInt::countTrailingOnesSlowCase
133+
5810 13.7% 64.1% 11998 28.4% malloc@@GLIBC_2.2.5
134+
5007 11.8% 76.0% 5007 11.8% llvm::APInt::orAssignSlowCase
135+
3996 9.4% 85.4% 17814 42.1% llvm::APInt::initSlowCase
136+
2237 5.3% 90.7% 42125 99.6% findDemandedEltsByAllUsers
137+
1211 2.9% 93.6% 1211 2.9% timer_settime@GLIBC_2.2.5
138+
1153 2.7% 96.3% 1153 2.7% __nss_database_lookup@GLIBC_2.2.5
139+
538 1.3% 97.6% 538 1.3% _fini
140+
402 1.0% 98.5% 8803 20.8% free@@GLIBC_2.2.5
141+
160 0.4% 98.9% 12309 29.1% operator new@@GLIBCXX_3.4
142+
154 0.4% 99.3% 154 0.4% std::__once_callable@@GLIBCXX_3.4.11
143+
141 0.3% 99.6% 141 0.3% operator delete@@GLIBCXX_3.4
144+
14 0.0% 99.6% 14 0.0% llvm::BasicBlock::renumberInstructions
145+
12 0.0% 99.7% 42176 99.7% llvm::InstCombinerImpl::run
146+
11 0.0% 99.7% 12 0.0% llvm::SymbolTableListTraits::addNodeToList
147+
11 0.0% 99.7% 22 0.1% passingValueIsAlwaysUndefined
148+
7 0.0% 99.7% 7 0.0% llvm::AttributeList::hasFnAttr
149+
5 0.0% 99.7% 8 0.0% llvm::Instruction::mayThrow
150+
3 0.0% 99.7% 3 0.0% llvm::User::isDroppable
151+
3 0.0% 99.7% 3 0.0% llvm::Value::getName
152+
3 0.0% 99.8% 3 0.0% llvm::Value::setValueName
153+
3 0.0% 99.8% 9 0.0% llvm::isa@64af770
154+
2 0.0% 99.8% 2 0.0% isBlockInLCSSAForm
155+
2 0.0% 99.8% 5 0.0% llvm::CallBase::hasFnAttrOnCalledFunction
156+
2 0.0% 99.8% 7 0.0% llvm::InstCombinerImpl::visitCallBase
157+
2 0.0% 99.8% 42134 99.6% llvm::InstCombinerImpl::visitExtractElementInst
158+
159+
The 5-th column shows the percentage of the samples, counted for each function.
160+
From the above report, we can see, that the most time is spent in the
161+
`InstCombinerImpl::run/visitExtractElementInst` functions. Most probably, the optimizer spent
162+
so much time, analysing the `extractelement` operations. If we look at the Triton cache
163+
`~/.triton/cache`, we will find the `kernel.llir` file, which contains a huge amount
164+
of `extractelement` operations, that confirms our assumption.

0 commit comments

Comments
 (0)