|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +.. _imc: |
| 3 | + |
| 4 | +=================================== |
| 5 | +IMC (In-Memory Collection Counters) |
| 6 | +=================================== |
| 7 | + |
| 8 | +Anju T Sudhakar, 10 May 2019 |
| 9 | + |
| 10 | +.. contents:: |
| 11 | + :depth: 3 |
| 12 | + |
| 13 | + |
| 14 | +Basic overview |
| 15 | +============== |
| 16 | + |
| 17 | +IMC (In-Memory collection counters) is a hardware monitoring facility that |
| 18 | +collects large numbers of hardware performance events at Nest level (these are |
| 19 | +on-chip but off-core), Core level and Thread level. |
| 20 | + |
| 21 | +The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC |
| 22 | +(On-Chip Controller) complex. The microcode collects the counter data and moves |
| 23 | +the nest IMC counter data to memory. |
| 24 | + |
| 25 | +The Core and Thread IMC PMU counters are handled in the core. Core level PMU |
| 26 | +counters give us the IMC counters' data per core and thread level PMU counters |
| 27 | +give us the IMC counters' data per CPU thread. |
| 28 | + |
| 29 | +OPAL obtains the IMC PMU and supported events information from the IMC Catalog |
| 30 | +and passes on to the kernel via the device tree. The event's information |
| 31 | +contains: |
| 32 | + |
| 33 | +- Event name |
| 34 | +- Event Offset |
| 35 | +- Event description |
| 36 | + |
| 37 | +and possibly also: |
| 38 | + |
| 39 | +- Event scale |
| 40 | +- Event unit |
| 41 | + |
| 42 | +Some PMUs may have a common scale and unit values for all their supported |
| 43 | +events. For those cases, the scale and unit properties for those events must be |
| 44 | +inherited from the PMU. |
| 45 | + |
| 46 | +The event offset in the memory is where the counter data gets accumulated. |
| 47 | + |
| 48 | +IMC catalog is available at: |
| 49 | + https://github.com/open-power/ima-catalog |
| 50 | + |
| 51 | +The kernel discovers the IMC counters information in the device tree at the |
| 52 | +`imc-counters` device node which has a compatible field |
| 53 | +`ibm,opal-in-memory-counters`. From the device tree, the kernel parses the PMUs |
| 54 | +and their event's information and register the PMU and its attributes in the |
| 55 | +kernel. |
| 56 | + |
| 57 | +IMC example usage |
| 58 | +================= |
| 59 | + |
| 60 | +.. code-block:: sh |
| 61 | +
|
| 62 | + # perf list |
| 63 | + [...] |
| 64 | + nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/ [Kernel PMU event] |
| 65 | + nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/ [Kernel PMU event] |
| 66 | + [...] |
| 67 | + core_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event] |
| 68 | + core_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event] |
| 69 | + [...] |
| 70 | + thread_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event] |
| 71 | + thread_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event] |
| 72 | +
|
| 73 | +To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/: |
| 74 | + |
| 75 | +.. code-block:: sh |
| 76 | +
|
| 77 | + # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket |
| 78 | +
|
| 79 | +To see non-idle instructions for core 0: |
| 80 | + |
| 81 | +.. code-block:: sh |
| 82 | +
|
| 83 | + # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000 |
| 84 | +
|
| 85 | +To see non-idle instructions for a "make": |
| 86 | + |
| 87 | +.. code-block:: sh |
| 88 | +
|
| 89 | + # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make |
| 90 | +
|
| 91 | +
|
| 92 | +IMC Trace-mode |
| 93 | +=============== |
| 94 | + |
| 95 | +POWER9 supports two modes for IMC which are the Accumulation mode and Trace |
| 96 | +mode. In Accumulation mode, event counts are accumulated in system Memory. |
| 97 | +Hypervisor then reads the posted counts periodically or when requested. In IMC |
| 98 | +Trace mode, the 64 bit trace SCOM value is initialized with the event |
| 99 | +information. The CPMCxSEL and CPMC_LOAD in the trace SCOM, specifies the event |
| 100 | +to be monitored and the sampling duration. On each overflow in the CPMCxSEL, |
| 101 | +hardware snapshots the program counter along with event counts and writes into |
| 102 | +memory pointed by LDBAR. |
| 103 | + |
| 104 | +LDBAR is a 64 bit special purpose per thread register, it has bits to indicate |
| 105 | +whether hardware is configured for accumulation or trace mode. |
| 106 | + |
| 107 | +LDBAR Register Layout |
| 108 | +--------------------- |
| 109 | + |
| 110 | + +-------+----------------------+ |
| 111 | + | 0 | Enable/Disable | |
| 112 | + +-------+----------------------+ |
| 113 | + | 1 | 0: Accumulation Mode | |
| 114 | + | +----------------------+ |
| 115 | + | | 1: Trace Mode | |
| 116 | + +-------+----------------------+ |
| 117 | + | 2:3 | Reserved | |
| 118 | + +-------+----------------------+ |
| 119 | + | 4-6 | PB scope | |
| 120 | + +-------+----------------------+ |
| 121 | + | 7 | Reserved | |
| 122 | + +-------+----------------------+ |
| 123 | + | 8:50 | Counter Address | |
| 124 | + +-------+----------------------+ |
| 125 | + | 51:63 | Reserved | |
| 126 | + +-------+----------------------+ |
| 127 | + |
| 128 | +TRACE_IMC_SCOM bit representation |
| 129 | +--------------------------------- |
| 130 | + |
| 131 | + +-------+------------+ |
| 132 | + | 0:1 | SAMPSEL | |
| 133 | + +-------+------------+ |
| 134 | + | 2:33 | CPMC_LOAD | |
| 135 | + +-------+------------+ |
| 136 | + | 34:40 | CPMC1SEL | |
| 137 | + +-------+------------+ |
| 138 | + | 41:47 | CPMC2SEL | |
| 139 | + +-------+------------+ |
| 140 | + | 48:50 | BUFFERSIZE | |
| 141 | + +-------+------------+ |
| 142 | + | 51:63 | RESERVED | |
| 143 | + +-------+------------+ |
| 144 | + |
| 145 | +CPMC_LOAD contains the sampling duration. SAMPSEL and CPMCxSEL determines the |
| 146 | +event to count. BUFFERSIZE indicates the memory range. On each overflow, |
| 147 | +hardware snapshots the program counter along with event counts and updates the |
| 148 | +memory and reloads the CMPC_LOAD value for the next sampling duration. IMC |
| 149 | +hardware does not support exceptions, so it quietly wraps around if memory |
| 150 | +buffer reaches the end. |
| 151 | + |
| 152 | +*Currently the event monitored for trace-mode is fixed as cycle.* |
| 153 | + |
| 154 | +Trace IMC example usage |
| 155 | +======================= |
| 156 | + |
| 157 | +.. code-block:: sh |
| 158 | +
|
| 159 | + # perf list |
| 160 | + [....] |
| 161 | + trace_imc/trace_cycles/ [Kernel PMU event] |
| 162 | +
|
| 163 | +To record an application/process with trace-imc event: |
| 164 | + |
| 165 | +.. code-block:: sh |
| 166 | +
|
| 167 | + # perf record -e trace_imc/trace_cycles/ yes > /dev/null |
| 168 | + [ perf record: Woken up 1 times to write data ] |
| 169 | + [ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ] |
| 170 | +
|
| 171 | +The `perf.data` generated, can be read using perf report. |
| 172 | + |
| 173 | +Benefits of using IMC trace-mode |
| 174 | +================================ |
| 175 | + |
| 176 | +PMI (Performance Monitoring Interrupts) interrupt handling is avoided, since IMC |
| 177 | +trace mode snapshots the program counter and updates to the memory. And this |
| 178 | +also provide a way for the operating system to do instruction sampling in real |
| 179 | +time without PMI processing overhead. |
| 180 | + |
| 181 | +Performance data using `perf top` with and without trace-imc event. |
| 182 | + |
| 183 | +PMI interrupts count when `perf top` command is executed without trace-imc event. |
| 184 | + |
| 185 | +.. code-block:: sh |
| 186 | +
|
| 187 | + # grep PMI /proc/interrupts |
| 188 | + PMI: 0 0 0 0 Performance monitoring interrupts |
| 189 | + # ./perf top |
| 190 | + ... |
| 191 | + # grep PMI /proc/interrupts |
| 192 | + PMI: 39735 8710 17338 17801 Performance monitoring interrupts |
| 193 | + # ./perf top -e trace_imc/trace_cycles/ |
| 194 | + ... |
| 195 | + # grep PMI /proc/interrupts |
| 196 | + PMI: 39735 8710 17338 17801 Performance monitoring interrupts |
| 197 | +
|
| 198 | +
|
| 199 | +That is, the PMI interrupt counts do not increment when using the `trace_imc` event. |
0 commit comments