Skip to content

Commit 4d9ef3f

Browse files
Merge pull request ceph#64331 from Jayaprakash-ibm/wip-cputrace-addition
common: Add cputrace tool for scope-based CPU performance tracing Reviewed-by: Adam Kupczyk <[email protected]>
2 parents 1b27ffe + adde7a5 commit 4d9ef3f

File tree

7 files changed

+1033
-0
lines changed

7 files changed

+1033
-0
lines changed

CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,8 @@ endif(WITH_LZ4)
431431
CMAKE_DEPENDENT_OPTION(WITH_CEPH_DEBUG_MUTEX "Use debug ceph::mutex with lockdep" ON
432432
"CMAKE_BUILD_TYPE STREQUAL Debug" OFF)
433433

434+
option(WITH_CPUTRACE "Enable Cputrace support via the admin socket" OFF)
435+
434436
#if allocator is set on command line make sure it matches below strings
435437
set(ALLOCATOR "" CACHE STRING
436438
"specify memory allocator to use. currently tcmalloc, tcmalloc_minimal, \

doc/dev/cputrace.rst

Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
========
2+
CpuTrace
3+
========
4+
5+
CpuTrace is a developer tool that measures the CPU cost of execution.
6+
It is useful when deciding between algorithms for new code and for
7+
validating performance enhancements.
8+
CpuTrace measures CPU instructions, clock cycles, branch mispredictions,
9+
cache misses and thread reschedules.
10+
11+
Integration into Ceph
12+
---------------------
13+
14+
To enable CpuTrace, build with the ``WITH_CPUTRACE`` flag:
15+
16+
.. code-block:: bash
17+
18+
./do_cmake.sh -DWITH_CPUTRACE=1
19+
20+
Once built with CpuTrace support, you can annotate specific functions
21+
or code regions using the provided macros and helper classes.
22+
23+
To enable profiling in your code, include the CpuTrace header:
24+
25+
.. code-block:: cpp
26+
27+
#include "common/cputrace.h"
28+
29+
Then you can mark functions for profiling using the provided helpers.
30+
31+
Raw counter mode
32+
----------------
33+
34+
CpuTrace is using the Linux ``perf_event_open`` syscall. You can use the tool
35+
as a simple helper to get access to hardware perf counters.
36+
37+
.. code-block:: cpp
38+
39+
// I am profiling my code and want to know
40+
// how many clock cycles and how many thread switches it takes
41+
HW_ctx hw = HW_ctx_empty;
42+
HW_init(&hw, HW_PROFILE_SWI|HW_PROFILE_CYC);
43+
sample_t start, end;
44+
HW_read(&hw, &start);
45+
// my code starts
46+
// .....
47+
// my code ends
48+
HW_read(&hw, &end);
49+
// task_switches = end.swi - start.swi;
50+
// clock_cycles = end.cyc - start.cyc;
51+
HW_clean(&hw);
52+
53+
By inspecting ``task_switches`` and ``clock_cycles`` the developer can learn that
54+
real clock execution time of 10ms has only 1M clock cycles, but had 2 task switches.
55+
56+
Aggregating samples
57+
-------------------
58+
59+
A single readout of execution time is usually not enough. We need more samples
60+
to get a more realistic measurement of actual execution cost.
61+
62+
.. code-block:: cpp
63+
64+
// a variable to hold my measurement
65+
static measurement_t my_code_time;
66+
sample_t start, end, elapsed;
67+
// hw initialized somewhere else
68+
HW_read(&hw, &start);
69+
// my code starts
70+
// .....
71+
// my code ends
72+
HW_read(&hw, &end);
73+
elapsed = end - start;
74+
// add new sample to the whole measurement
75+
my_code_time.sample(elapsed);
76+
77+
``measurement_t``
78+
-----------------
79+
80+
The ``measurement_t`` type aggregates collected samples and counts the number
81+
of measurements performed.
82+
83+
It produces summary statistics that include:
84+
85+
- **count** : total number of measurements
86+
- **average** : mean value across all samples
87+
- **zero / non-zero split** : how many measurements were exactly zero
88+
versus greater than zero (only for context switch metrics)
89+
90+
These statistics provide a compact and clear view of performance measurements.
91+
92+
``measurement_t`` can also export results in two formats:
93+
94+
- **Ceph Formatter** (for structured JSON/YAML/XML output):
95+
96+
.. code-block:: cpp
97+
98+
ceph::Formatter* jf;
99+
m->dump(jf, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output
100+
101+
- **String stream** (for plain-text logging):
102+
103+
.. code-block:: cpp
104+
105+
std::stringstream ss;
106+
m->dump_to_stringstream(ss, HW_PROFILE_CYC|HW_PROFILE_INS); // Select which stats to output
107+
std::cout << ss.str();
108+
109+
This makes it easy to either integrate measurements into Ceph’s
110+
structured output pipeline or dump them as human-readable text for debugging.
111+
112+
RAII samples
113+
------------
114+
115+
It is usually most convenient to use RAII to collect samples.
116+
With RAII, measurement begins automatically when the guard object is created
117+
and ends when it goes out of scope, so no explicit start/stop calls are required.
118+
119+
The hardware context (``HW_ctx``) must be initialized once before creating
120+
guards. After initialization, the same context can be reused across multiple
121+
measurements.
122+
123+
``HW_guard`` takes two arguments:
124+
125+
- ``HW_ctx* ctx``
126+
Pointer to the initialized hardware context.
127+
128+
- ``measurement_t* m``
129+
Pointer to the measurement object where results will be stored.
130+
131+
132+
Example:
133+
134+
.. code-block:: cpp
135+
136+
// variable to hold measurement results
137+
static measurement_t my_code_time;
138+
{
139+
HW_guard guard(&hw, &my_code_time);
140+
// code to be measured
141+
// ...
142+
}
143+
144+
Named measurements
145+
------------------
146+
147+
Code regions can be measured using a `named guard`.
148+
Each ``HW_named_guard`` automatically starts measurement at construction and stops when leaving scope.
149+
150+
.. code-block:: cpp
151+
152+
{
153+
HW_named_guard("function", &hw);
154+
// my code starts
155+
// ...
156+
// my code ends
157+
}
158+
159+
This example records the execution time of ``function``.
160+
161+
The guard requires a pointer to a previously initialized ``HW_ctx``.
162+
This context must be created and set up (e.g., during program initialization)
163+
before guards can be used.
164+
165+
Named guards provide a simple and consistent way to track performance metrics.
166+
167+
To later access the collected measurements for a given name, use:
168+
169+
.. code-block:: cpp
170+
171+
measurement_t* m = get_named_measurement("function");
172+
if (m) {
173+
// inspect m->sum_cyc, m->sum_ins.
174+
// m->dump_to_stringstream(ss, HW_PROFILE_INS|HW_PROFILE_CYC);
175+
}
176+
177+
Admin socket integration
178+
------------------------
179+
180+
In addition to direct instrumentation in code, CpuTrace can also be controlled
181+
at runtime via the admin socket interface. This allows developers to start,
182+
stop, and inspect profiling in running Ceph daemons without rebuilding or
183+
restarting them.
184+
185+
To profile a function, annotate it with the provided macros:
186+
187+
.. code-block:: cpp
188+
189+
HWProfileFunctionF(profile, __func__,
190+
HW_PROFILE_CYC | HW_PROFILE_CMISS |
191+
HW_PROFILE_INS | HW_PROFILE_BMISS |
192+
HW_PROFILE_SWI);
193+
194+
- ``profile`` is a local variable name for the profiler object and only needs to be unique within the profiling scope.
195+
- ``__func__`` (or any string you pass as the name) is the unique anchor name for this profiling scope.
196+
197+
Each unique name creates a separate anchor. Reusing the same name in multiple places will trigger an assertion failure.
198+
199+
This macro automatically attaches a profiler to the function scope and
200+
collects the specified hardware counters each time the function executes.
201+
202+
You can combine any of the available flags:
203+
204+
* ``HW_PROFILE_CYC`` – CPU cycles
205+
* ``HW_PROFILE_CMISS`` – Cache misses
206+
* ``HW_PROFILE_BMISS`` – Branch mispredictions
207+
* ``HW_PROFILE_INS`` – Instructions retired
208+
* ``HW_PROFILE_SWI`` – Context switches
209+
210+
Available commands:
211+
212+
* ``cputrace start`` – Start profiling with the configured groups/counters
213+
* ``cputrace stop`` – Stop profiling and freeze results
214+
* ``cputrace dump`` – Dump all collected metrics (as JSON or plain text)
215+
* ``cputrace reset`` – Reset all captured data
216+
217+
Profiling counters are cumulative. `cputrace stop` pauses profiling without
218+
resetting values. `cputrace start` resumes accumulation. Use `cputrace reset`
219+
to clear all collected metrics.
220+
221+
Example usage from the command line:
222+
223+
.. code-block:: bash
224+
225+
# Start profiling on OSD.0
226+
ceph tell osd.0 cputrace start
227+
228+
# Stop profiling
229+
ceph tell osd.0 cputrace stop
230+
231+
# Dump results
232+
ceph tell osd.0 cputrace dump
233+
234+
# Reset counters
235+
ceph tell osd.0 cputrace reset
236+
237+
These commands can be repeated multiple times: developers typically
238+
``start`` before a workload, ``stop`` afterwards, and then ``dump`` the results
239+
to analyze them.
240+
241+
``cputrace dump`` supports optional arguments to filter by logger or counter,
242+
so only a subset of metrics can be reported when needed.
243+
244+
``cputrace reset`` clears all data, preparing for a fresh round of profiling.
245+
246+
API Reference
247+
-------------
248+
249+
Enums
250+
~~~~~
251+
252+
.. code-block:: cpp
253+
254+
enum cputrace_flags {
255+
HW_PROFILE_SWI = (1ULL << 0), // Context switches
256+
HW_PROFILE_CYC = (1ULL << 1), // CPU cycles
257+
HW_PROFILE_CMISS = (1ULL << 2), // Cache misses
258+
HW_PROFILE_BMISS = (1ULL << 3), // Branch mispredictions
259+
HW_PROFILE_INS = (1ULL << 4), // Instructions retired
260+
};
261+
262+
The bitwise ``|`` operator may be used to combine these flags.
263+
264+
Data structures
265+
~~~~~~~~~~~~~~~
266+
267+
``sample_t`` – holds a single hardware counter snapshot.
268+
269+
.. code-block:: cpp
270+
271+
struct sample_t {
272+
uint64_t swi; //context switches
273+
uint64_t cyc; //clock cycles
274+
uint64_t cmiss; //cache misses
275+
uint64_t bmiss; //branch misses
276+
uint64_t ins; //instructions
277+
};
278+
279+
``measurement_t`` – accumulates multiple samples and computes totals/averages and other
280+
useful metrics.
281+
282+
.. code-block:: cpp
283+
284+
struct measurement_t {
285+
uint64_t call_count = 0;
286+
uint64_t sample_count = 0;
287+
uint64_t sum_swi = 0, sum_cyc = 0, sum_cmiss = 0, sum_bmiss = 0, sum_ins = 0;
288+
uint64_t non_zero_swi_count = 0;
289+
uint64_t zero_swi_count = 0;
290+
};
291+
292+
293+
``HW_ctx`` – encapsulates perf-event file descriptors for one measurement context.
294+
295+
.. code-block:: cpp
296+
297+
extern HW_ctx HW_ctx_empty;
298+
299+
Low-level API
300+
~~~~~~~~~~~~~
301+
302+
- ``void HW_init(HW_ctx* ctx, cputrace_flags flags)`` – initialize perf counters.
303+
- ``void HW_read(HW_ctx* ctx, sample_t* out)`` – read current counter values.
304+
- ``void HW_clean(HW_ctx* ctx)`` – release perf counters.

src/CMakeLists.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,10 @@ if(WITH_CEPH_DEBUG_MUTEX)
199199
add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-DCEPH_DEBUG_MUTEX>)
200200
endif()
201201

202+
if(WITH_CPUTRACE)
203+
add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-DWITH_CPUTRACE>)
204+
endif()
205+
202206
include(CheckCCompilerFlag)
203207
if(CMAKE_CXX_COMPILER_ID STREQUAL GNU)
204208
CHECK_C_COMPILER_FLAG(-fstack-protector-strong HAS_STACK_PROTECT)

src/common/CMakeLists.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,13 @@ set(common_srcs
112112
mclock_common.cc
113113
tcp_info.cc)
114114

115+
if(WITH_CPUTRACE)
116+
list(APPEND common_srcs
117+
cputrace.cc)
118+
endif()
119+
115120
include_directories(${CMAKE_SOURCE_DIR}/src/dmclock/support/src)
121+
116122
if(WITH_SYSTEMD)
117123
list(APPEND common_srcs
118124
Journald.cc)

src/common/ceph_context.cc

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,10 @@
6868
// for CINIT_FLAGS
6969
#include "common/common_init.h"
7070

71+
#ifdef WITH_CPUTRACE
72+
#include "common/cputrace.h"
73+
#endif
74+
7175
#include <iostream>
7276
#include <pthread.h>
7377

@@ -686,6 +690,24 @@ int CephContext::_do_command(
686690
else if (command == "log reopen") {
687691
_log->reopen_log_file();
688692
}
693+
#ifdef WITH_CPUTRACE
694+
else if (command == "cputrace start") {
695+
cputrace_start(f);
696+
}
697+
else if (command == "cputrace stop") {
698+
cputrace_stop(f);
699+
}
700+
else if (command == "cputrace dump") {
701+
std::string logger;
702+
std::string counter;
703+
cmd_getval(cmdmap, "logger", logger);
704+
cmd_getval(cmdmap, "counter", counter);
705+
cputrace_dump(f, logger, counter);
706+
}
707+
else if (command == "cputrace reset") {
708+
cputrace_reset(f);
709+
}
710+
#endif
689711
else {
690712
ceph_abort_msg("registered under wrong command?");
691713
}
@@ -782,6 +804,12 @@ CephContext::CephContext(uint32_t module_type_,
782804
_admin_socket->register_command("log dump", _admin_hook, "dump recent log entries to log file");
783805
_admin_socket->register_command("log reopen", _admin_hook, "reopen log file");
784806

807+
#ifdef WITH_CPUTRACE
808+
_admin_socket->register_command("cputrace start", _admin_hook, "start cpu profiling");
809+
_admin_socket->register_command("cputrace stop", _admin_hook, "stop cpu profiling");
810+
_admin_socket->register_command("cputrace reset", _admin_hook, "reset cpu profiling");
811+
_admin_socket->register_command("cputrace dump name=logger,type=CephString,req=false name=counter,type=CephString,req=false", _admin_hook, "dump cpu profiling results");
812+
#endif
785813
_crypto_none = CryptoHandler::create(CEPH_CRYPTO_NONE);
786814
_crypto_aes = CryptoHandler::create(CEPH_CRYPTO_AES);
787815
_crypto_random.reset(new CryptoRandom());

0 commit comments

Comments
 (0)