|
| 1 | +======== |
| 2 | +CpuTrace |
| 3 | +======== |
| 4 | + |
| 5 | +CpuTrace is a developer tool that measures the CPU cost of execution. |
| 6 | +It is useful when deciding between algorithms for new code and for |
| 7 | +validating performance enhancements. |
| 8 | +CpuTrace measures CPU instructions, clock cycles, branch mispredictions, |
| 9 | +cache misses and thread reschedules. |
| 10 | + |
| 11 | +Integration into Ceph |
| 12 | +--------------------- |
| 13 | + |
| 14 | +To enable CpuTrace, build with the ``WITH_CPUTRACE`` flag: |
| 15 | + |
| 16 | +.. code-block:: bash |
| 17 | +
|
| 18 | + ./do_cmake.sh -DWITH_CPUTRACE=1 |
| 19 | +
|
| 20 | +Once built with CpuTrace support, you can annotate specific functions |
| 21 | +or code regions using the provided macros and helper classes. |
| 22 | + |
| 23 | +To enable profiling in your code, include the CpuTrace header: |
| 24 | + |
| 25 | +.. code-block:: cpp |
| 26 | +
|
| 27 | + #include "common/cputrace.h" |
| 28 | +
|
| 29 | +Then you can mark functions for profiling using the provided helpers. |
| 30 | + |
| 31 | +Raw counter mode |
| 32 | +---------------- |
| 33 | + |
| 34 | +CpuTrace is using the Linux ``perf_event_open`` syscall. You can use the tool |
| 35 | +as a simple helper to get access to hardware perf counters. |
| 36 | + |
| 37 | +.. code-block:: cpp |
| 38 | +
|
| 39 | + // I am profiling my code and want to know |
| 40 | + // how many clock cycles and how many thread switches it takes |
| 41 | + HW_ctx hw = HW_ctx_empty; |
| 42 | + HW_init(&hw, HW_PROFILE_SWI|HW_PROFILE_CYC); |
| 43 | + sample_t start, end; |
| 44 | + HW_read(&hw, &start); |
| 45 | + // my code starts |
| 46 | + // ..... |
| 47 | + // my code ends |
| 48 | + HW_read(&hw, &end); |
| 49 | + // task_switches = end.swi - start.swi; |
| 50 | + // clock_cycles = end.cyc - start.cyc; |
| 51 | + HW_clean(&hw); |
| 52 | +
|
| 53 | +By inspecting ``task_switches`` and ``clock_cycles`` the developer can learn that |
| 54 | +real clock execution time of 10ms has only 1M clock cycles, but had 2 task switches. |
| 55 | + |
| 56 | +Aggregating samples |
| 57 | +------------------- |
| 58 | + |
| 59 | +A single readout of execution time is usually not enough. We need more samples |
| 60 | +to get a more realistic measurement of actual execution cost. |
| 61 | + |
| 62 | +.. code-block:: cpp |
| 63 | +
|
| 64 | + // a variable to hold my measurement |
| 65 | + static measurement_t my_code_time; |
| 66 | + sample_t start, end, elapsed; |
| 67 | + // hw initialized somewhere else |
| 68 | + HW_read(&hw, &start); |
| 69 | + // my code starts |
| 70 | + // ..... |
| 71 | + // my code ends |
| 72 | + HW_read(&hw, &end); |
| 73 | + elapsed = end - start; |
| 74 | + // add new sample to the whole measurement |
| 75 | + my_code_time.sample(elapsed); |
| 76 | +
|
| 77 | +``measurement_t`` |
| 78 | +----------------- |
| 79 | + |
| 80 | +The ``measurement_t`` type aggregates collected samples and counts the number |
| 81 | +of measurements performed. |
| 82 | + |
| 83 | +It produces summary statistics that include: |
| 84 | + |
| 85 | +- **count** : total number of measurements |
| 86 | +- **average** : mean value across all samples |
| 87 | +- **zero / non-zero split** : how many measurements were exactly zero |
| 88 | + versus greater than zero (only for context switch metrics) |
| 89 | + |
| 90 | +These statistics provide a compact and clear view of performance measurements. |
| 91 | + |
| 92 | +``measurement_t`` can also export results in two formats: |
| 93 | + |
| 94 | +- **Ceph Formatter** (for structured JSON/YAML/XML output): |
| 95 | + |
| 96 | + .. code-block:: cpp |
| 97 | +
|
| 98 | + ceph::Formatter* jf; |
| 99 | + m->dump(jf, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output |
| 100 | +
|
| 101 | +- **String stream** (for plain-text logging): |
| 102 | + |
| 103 | + .. code-block:: cpp |
| 104 | +
|
| 105 | + std::stringstream ss; |
| 106 | + m->dump_to_stringstream(ss, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output |
| 107 | + std::cout << ss.str(); |
| 108 | +
|
| 109 | +This makes it easy to either integrate measurements into Ceph’s |
| 110 | +structured output pipeline or dump them as human-readable text for debugging. |
| 111 | + |
| 112 | +RAII samples |
| 113 | +------------ |
| 114 | + |
| 115 | +It is usually most convenient to use RAII to collect samples. |
| 116 | +With RAII, measurement begins automatically when the guard object is created |
| 117 | +and ends when it goes out of scope, so no explicit start/stop calls are required. |
| 118 | + |
| 119 | +The hardware context (``HW_ctx``) must be initialized once before creating |
| 120 | +guards. After initialization, the same context can be reused across multiple |
| 121 | +measurements. |
| 122 | + |
| 123 | +``HW_guard`` takes two arguments: |
| 124 | + |
| 125 | +- ``HW_ctx* ctx`` |
| 126 | + Pointer to the initialized hardware context. |
| 127 | + |
| 128 | +- ``measurement_t* m`` |
| 129 | + Pointer to the measurement object where results will be stored. |
| 130 | + |
| 131 | + |
| 132 | +Example: |
| 133 | + |
| 134 | +.. code-block:: cpp |
| 135 | +
|
| 136 | + // variable to hold measurement results |
| 137 | + static measurement_t my_code_time; |
| 138 | + { |
| 139 | + HW_guard guard(&hw, &my_code_time); |
| 140 | + // code to be measured |
| 141 | + // ... |
| 142 | + } |
| 143 | +
|
| 144 | +Named measurements |
| 145 | +------------------ |
| 146 | + |
| 147 | +Code regions can be measured using a `named guard`. |
| 148 | +Each ``HW_named_guard`` automatically starts measurement at construction and stops when leaving scope. |
| 149 | + |
| 150 | +.. code-block:: cpp |
| 151 | +
|
| 152 | + { |
| 153 | + HW_named_guard("function", &hw); |
| 154 | + // my code starts |
| 155 | + // ... |
| 156 | + // my code ends |
| 157 | + } |
| 158 | +
|
| 159 | +This example records the execution time of ``function``. |
| 160 | + |
| 161 | +The guard requires a pointer to a previously initialized ``HW_ctx``. |
| 162 | +This context must be created and set up (e.g., during program initialization) |
| 163 | +before guards can be used. |
| 164 | + |
| 165 | +Named guards provide a simple and consistent way to track performance metrics. |
| 166 | + |
| 167 | +To later access the collected measurements for a given name, use: |
| 168 | + |
| 169 | +.. code-block:: cpp |
| 170 | +
|
| 171 | + measurement_t* m = get_named_measurement("function"); |
| 172 | + if (m) { |
| 173 | + // inspect m->sum_cyc, m->sum_ins. |
| 174 | + // m->dump_to_stringstream(ss, HW_PROFILE_INS|HW_PROFILE_CYC); |
| 175 | + } |
| 176 | +
|
| 177 | +Admin socket integration |
| 178 | +------------------------ |
| 179 | + |
| 180 | +In addition to direct instrumentation in code, CpuTrace can also be controlled |
| 181 | +at runtime via the admin socket interface. This allows developers to start, |
| 182 | +stop, and inspect profiling in running Ceph daemons without rebuilding or |
| 183 | +restarting them. |
| 184 | + |
| 185 | +To profile a function, annotate it with the provided macros: |
| 186 | + |
| 187 | +.. code-block:: cpp |
| 188 | +
|
| 189 | + HWProfileFunctionF(profile, __func__, |
| 190 | + HW_PROFILE_CYC | HW_PROFILE_CMISS | |
| 191 | + HW_PROFILE_INS | HW_PROFILE_BMISS | |
| 192 | + HW_PROFILE_SWI); |
| 193 | +
|
| 194 | +- ``profile`` is a local variable name for the profiler object and only needs to be unique within the profiling scope. |
| 195 | +- ``__func__`` (or any string you pass as the name) is the unique anchor name for this profiling scope. |
| 196 | + |
| 197 | +Each unique name creates a separate anchor. Reusing the same name in multiple places will trigger an assertion failure. |
| 198 | + |
| 199 | +This macro automatically attaches a profiler to the function scope and |
| 200 | +collects the specified hardware counters each time the function executes. |
| 201 | + |
| 202 | +You can combine any of the available flags: |
| 203 | + |
| 204 | +* ``HW_PROFILE_CYC`` – CPU cycles |
| 205 | +* ``HW_PROFILE_CMISS`` – Cache misses |
| 206 | +* ``HW_PROFILE_BMISS`` – Branch mispredictions |
| 207 | +* ``HW_PROFILE_INS`` – Instructions retired |
| 208 | +* ``HW_PROFILE_SWI`` – Context switches |
| 209 | + |
| 210 | +Available commands: |
| 211 | + |
| 212 | +* ``cputrace start`` – Start profiling with the configured groups/counters |
| 213 | +* ``cputrace stop`` – Stop profiling and freeze results |
| 214 | +* ``cputrace dump`` – Dump all collected metrics (as JSON or plain text) |
| 215 | +* ``cputrace reset`` – Reset all captured data |
| 216 | + |
| 217 | +Profiling counters are cumulative. `cputrace stop` pauses profiling without |
| 218 | +resetting values. `cputrace start` resumes accumulation. Use `cputrace reset` |
| 219 | +to clear all collected metrics. |
| 220 | + |
| 221 | +Example usage from the command line: |
| 222 | + |
| 223 | +.. code-block:: bash |
| 224 | +
|
| 225 | + # Start profiling on OSD.0 |
| 226 | + ceph tell osd.0 cputrace start |
| 227 | +
|
| 228 | + # Stop profiling |
| 229 | + ceph tell osd.0 cputrace stop |
| 230 | +
|
| 231 | + # Dump results |
| 232 | + ceph tell osd.0 cputrace dump |
| 233 | +
|
| 234 | + # Reset counters |
| 235 | + ceph tell osd.0 cputrace reset |
| 236 | +
|
| 237 | +These commands can be repeated multiple times: developers typically |
| 238 | +``start`` before a workload, ``stop`` afterwards, and then ``dump`` the results |
| 239 | +to analyze them. |
| 240 | + |
| 241 | +``cputrace dump`` supports optional arguments to filter by logger or counter, |
| 242 | +so only a subset of metrics can be reported when needed. |
| 243 | + |
| 244 | +``cputrace reset`` clears all data, preparing for a fresh round of profiling. |
| 245 | + |
| 246 | +API Reference |
| 247 | +------------- |
| 248 | + |
| 249 | +Enums |
| 250 | +~~~~~ |
| 251 | + |
| 252 | +.. code-block:: cpp |
| 253 | +
|
| 254 | + enum cputrace_flags { |
| 255 | + HW_PROFILE_SWI = (1ULL << 0), // Context switches |
| 256 | + HW_PROFILE_CYC = (1ULL << 1), // CPU cycles |
| 257 | + HW_PROFILE_CMISS = (1ULL << 2), // Cache misses |
| 258 | + HW_PROFILE_BMISS = (1ULL << 3), // Branch mispredictions |
| 259 | + HW_PROFILE_INS = (1ULL << 4), // Instructions retired |
| 260 | + }; |
| 261 | +
|
| 262 | +The bitwise ``|`` operator may be used to combine these flags. |
| 263 | + |
| 264 | +Data structures |
| 265 | +~~~~~~~~~~~~~~~ |
| 266 | + |
| 267 | +``sample_t`` – holds a single hardware counter snapshot. |
| 268 | + |
| 269 | +.. code-block:: cpp |
| 270 | +
|
| 271 | + struct sample_t { |
| 272 | + uint64_t swi; //context switches |
| 273 | + uint64_t cyc; //clock cycles |
| 274 | + uint64_t cmiss; //cache misses |
| 275 | + uint64_t bmiss; //branch misses |
| 276 | + uint64_t ins; //instructions |
| 277 | + }; |
| 278 | +
|
| 279 | +``measurement_t`` – accumulates multiple samples and computes totals/averages and other |
| 280 | +useful metrics. |
| 281 | + |
| 282 | +.. code-block:: cpp |
| 283 | +
|
| 284 | + struct measurement_t { |
| 285 | + uint64_t call_count = 0; |
| 286 | + uint64_t sample_count = 0; |
| 287 | + uint64_t sum_swi = 0, sum_cyc = 0, sum_cmiss = 0, sum_bmiss = 0, sum_ins = 0; |
| 288 | + uint64_t non_zero_swi_count = 0; |
| 289 | + uint64_t zero_swi_count = 0; |
| 290 | + }; |
| 291 | +
|
| 292 | +
|
| 293 | +``HW_ctx`` – encapsulates perf-event file descriptors for one measurement context. |
| 294 | + |
| 295 | +.. code-block:: cpp |
| 296 | +
|
| 297 | + extern HW_ctx HW_ctx_empty; |
| 298 | +
|
| 299 | +Low-level API |
| 300 | +~~~~~~~~~~~~~ |
| 301 | + |
| 302 | +- ``void HW_init(HW_ctx* ctx, cputrace_flags flags)`` – initialize perf counters. |
| 303 | +- ``void HW_read(HW_ctx* ctx, sample_t* out)`` – read current counter values. |
| 304 | +- ``void HW_clean(HW_ctx* ctx)`` – release perf counters. |
0 commit comments