|
| 1 | +============================= |
| 2 | +Profiling kernels compilation |
| 3 | +============================= |
| 4 | + |
| 5 | +If a kernel takes too much time to compile, the first step to start profiling it |
| 6 | +is to run it with the following environment variables: |
| 7 | + |
| 8 | +.. code-block:: bash |
| 9 | +
|
| 10 | + export TRITON_ALWAYS_COMPILE=1 |
| 11 | + export MLIR_ENABLE_TIMING=1 |
| 12 | + export LLVM_ENABLE_TIMING=1 |
| 13 | + python -m pytest python/test/unit/intel/test_mxfp_matmul.py::test_mxfp_matmul[True-True-float4-float4-True-True-1-128-128-128-1024-512-512] --device=xpu -s |
| 14 | +
|
| 15 | +For the environment variables description, see the README.md file in the root directory of the project. |
| 16 | +The output will contain the statistic for each phase of the compilation and for MLIR passes. |
| 17 | +Here is an example: |
| 18 | + |
| 19 | +.. code-block:: text |
| 20 | +
|
| 21 | + ===-------------------------------------------------------------------------=== |
| 22 | + ... Execution time report ... |
| 23 | + ===-------------------------------------------------------------------------=== |
| 24 | + Total Execution Time: 483.1849 seconds |
| 25 | +
|
| 26 | + ----User Time---- ----Wall Time---- ----Name---- |
| 27 | + 0.0012 ( 0.0%) 0.0012 ( 0.0%) SCFToControlFlowPass |
| 28 | + 0.0015 ( 0.0%) 0.0015 ( 0.0%) ConvertIndexToLLVMPass |
| 29 | + 0.0516 ( 0.0%) 0.0516 ( 0.0%) IntelAllocateSharedMemory |
| 30 | + 0.0011 ( 0.0%) 0.0011 ( 0.0%) TritonGPUGlobalScratchAllocationPass |
| 31 | + 0.5947 ( 0.1%) 0.5947 ( 0.1%) ConvertTritonIntelGPUToLLVM |
| 32 | + 0.1232 ( 0.0%) 0.1232 ( 0.0%) ConvertTritonGENToLLVM |
| 33 | + 0.3430 ( 0.1%) 0.3430 ( 0.1%) TritonIntelGPURewriteStackPtr |
| 34 | + 15.3707 ( 3.1%) 15.3707 ( 3.2%) Canonicalizer |
| 35 | + 0.1650 ( 0.0%) 0.1650 ( 0.0%) CSE |
| 36 | + 0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo |
| 37 | + 0.0775 ( 0.0%) 0.0775 ( 0.0%) ArithToLLVMConversionPass |
| 38 | + 0.1934 ( 0.0%) 0.1934 ( 0.0%) Canonicalizer |
| 39 | + 0.1046 ( 0.0%) 0.1046 ( 0.0%) CSE |
| 40 | + 0.0000 ( 0.0%) 0.0000 ( 0.0%) (A) DominanceInfo |
| 41 | + 0.1470 ( 0.0%) 0.1470 ( 0.0%) SymbolDCE |
| 42 | + 0.0876 ( 0.0%) 0.0876 ( 0.0%) LLVMDIScope |
| 43 | + 483.1849 ( 96.6%) 465.9228 ( 96.4%) Rest |
| 44 | + 500.4471 (100.0%) 483.1849 (100.0%) Total |
| 45 | + 49 function calls in 465.865 seconds |
| 46 | +
|
| 47 | +What we can see here is that the most time-consuming part of the compilation is something, |
| 48 | +which is called "Rest" and we don't know what it is. For further investigation, we can |
| 49 | +use the Python's `cProfiler`. To profile individual functions, |
| 50 | +we can use the following decorator: |
| 51 | + |
| 52 | +.. code-block:: python |
| 53 | +
|
| 54 | + def profile(func): |
| 55 | + import cProfile |
| 56 | + import pstats |
| 57 | + import functools |
| 58 | +
|
| 59 | + @functools.wraps(func) |
| 60 | + def wrapper(*args, **kwargs): |
| 61 | + print(f"Profiling {func.__qualname__}") |
| 62 | + pr = cProfile.Profile() |
| 63 | + pr.enable() |
| 64 | + try: |
| 65 | + return func(*args, **kwargs) |
| 66 | + finally: |
| 67 | + pr.disable() |
| 68 | + ps = pstats.Stats(pr).sort_stats('cumulative') |
| 69 | + ps.print_stats(20) |
| 70 | + return wrapper |
| 71 | +
|
| 72 | +After decorating the `make_llir` function: |
| 73 | + |
| 74 | +.. code-block:: python |
| 75 | +
|
| 76 | + @profile |
| 77 | + def make_llir(...): |
| 78 | + ... |
| 79 | +
|
| 80 | +and running the program again, we will get the following results: |
| 81 | + |
| 82 | +.. code-block:: text |
| 83 | +
|
| 84 | + ncalls tottime percall cumtime percall filename:lineno(function) |
| 85 | + 1 465.552 465.552 465.552 465.552 {built-in method triton._C.libtriton.intel.optimize_module} |
| 86 | + 1 0.231 0.231 0.231 0.231 {built-in method triton._C.libtriton.llvm.to_module} |
| 87 | + ... |
| 88 | +
|
| 89 | +The new profile information indicates that most of the time is spent in the `optimize_module` function, which is a native function, |
| 90 | +written in C++. To profile this function, we will use `gperftools`. On an Ubuntu system, we need |
| 91 | +the following packages to be installed: |
| 92 | + |
| 93 | +.. code-block:: bash |
| 94 | +
|
| 95 | + sudo apt install google-perftools libgoogle-perftools-dev |
| 96 | +
|
| 97 | +We also need to link the Triton library with the `profiler` by editing following line in the |
| 98 | +`CMakeLists.txt` file: |
| 99 | + |
| 100 | +.. code-block:: cmake |
| 101 | +
|
| 102 | + --target_link_libraries(TritonXPU PRIVATE Python3::Module pybind11::headers) |
| 103 | + ++target_link_libraries(TritonXPU PRIVATE Python3::Module pybind11::headers profiler) |
| 104 | +
|
| 105 | +The `optimize_module` function is a python binding implemented in `third_party/intel/triton_xpu.cc`. |
| 106 | +To profile it, we need to add the following lines to the code: |
| 107 | + |
| 108 | +.. code-block:: cpp |
| 109 | +
|
| 110 | + ++#include <gperftools/profiler.h> |
| 111 | +
|
| 112 | + ... |
| 113 | +
|
| 114 | + "optimize_module", |
| 115 | + [](llvm::Module *mod, const llvm::OptimizationLevel &opt, |
| 116 | + std::string arch, std::string features, std::vector<std::string> flags, |
| 117 | + bool enable_fp_fusion) { |
| 118 | + ++ProfilerStart("optimize_module.prof"); |
| 119 | + ... |
| 120 | + ++ProfilerStop(); |
| 121 | +
|
| 122 | +In the code above, the `"optimize_module.prof"` string is the file path, where the profiling results will be saved. |
| 123 | +After running the program, we will get the `optimize_module.prof` binary file, which can be visualized using the |
| 124 | +`pprof-symbolize` tool: |
| 125 | + |
| 126 | +.. code-block:: bash |
| 127 | +
|
| 128 | + pprof-symbolize --text /usr/bin/python3 optimize_module.prof |
| 129 | +
|
| 130 | + Total: 42300 samples |
| 131 | + 13378 31.6% 31.6% 14589 34.5% __default_morecore@GLIBC_2.2.5 |
| 132 | + 7939 18.8% 50.4% 7939 18.8% llvm::APInt::countTrailingOnesSlowCase |
| 133 | + 5810 13.7% 64.1% 11998 28.4% malloc@@GLIBC_2.2.5 |
| 134 | + 5007 11.8% 76.0% 5007 11.8% llvm::APInt::orAssignSlowCase |
| 135 | + 3996 9.4% 85.4% 17814 42.1% llvm::APInt::initSlowCase |
| 136 | + 2237 5.3% 90.7% 42125 99.6% findDemandedEltsByAllUsers |
| 137 | + 1211 2.9% 93.6% 1211 2.9% timer_settime@GLIBC_2.2.5 |
| 138 | + 1153 2.7% 96.3% 1153 2.7% __nss_database_lookup@GLIBC_2.2.5 |
| 139 | + 538 1.3% 97.6% 538 1.3% _fini |
| 140 | + 402 1.0% 98.5% 8803 20.8% free@@GLIBC_2.2.5 |
| 141 | + 160 0.4% 98.9% 12309 29.1% operator new@@GLIBCXX_3.4 |
| 142 | + 154 0.4% 99.3% 154 0.4% std::__once_callable@@GLIBCXX_3.4.11 |
| 143 | + 141 0.3% 99.6% 141 0.3% operator delete@@GLIBCXX_3.4 |
| 144 | + 14 0.0% 99.6% 14 0.0% llvm::BasicBlock::renumberInstructions |
| 145 | + 12 0.0% 99.7% 42176 99.7% llvm::InstCombinerImpl::run |
| 146 | + 11 0.0% 99.7% 12 0.0% llvm::SymbolTableListTraits::addNodeToList |
| 147 | + 11 0.0% 99.7% 22 0.1% passingValueIsAlwaysUndefined |
| 148 | + 7 0.0% 99.7% 7 0.0% llvm::AttributeList::hasFnAttr |
| 149 | + 5 0.0% 99.7% 8 0.0% llvm::Instruction::mayThrow |
| 150 | + 3 0.0% 99.7% 3 0.0% llvm::User::isDroppable |
| 151 | + 3 0.0% 99.7% 3 0.0% llvm::Value::getName |
| 152 | + 3 0.0% 99.8% 3 0.0% llvm::Value::setValueName |
| 153 | + 3 0.0% 99.8% 9 0.0% llvm::isa@64af770 |
| 154 | + 2 0.0% 99.8% 2 0.0% isBlockInLCSSAForm |
| 155 | + 2 0.0% 99.8% 5 0.0% llvm::CallBase::hasFnAttrOnCalledFunction |
| 156 | + 2 0.0% 99.8% 7 0.0% llvm::InstCombinerImpl::visitCallBase |
| 157 | + 2 0.0% 99.8% 42134 99.6% llvm::InstCombinerImpl::visitExtractElementInst |
| 158 | +
|
| 159 | +The 5-th column shows the percentage of the samples, counted for each function. |
| 160 | +From the above report, we can see, that the most time is spent in the |
| 161 | +`InstCombinerImpl::run/visitExtractElementInst` functions. Most probably, the optimizer spent |
| 162 | +so much time, analysing the `extractelement` operations. If we look at the Triton cache |
| 163 | +`~/.triton/cache`, we will find the `kernel.llir` file, which contains a huge amount |
| 164 | +of `extractelement` operations, that confirms our assumption. |
0 commit comments