Profiling

pprof

pprof helps diagnose which functions use most of the CPU resources. Note that this works better by compiling lib with a test file rather than linking against it.

Run

compile with -g and link with -lprofiler
run with LDPRELOAD=/usr/lib/libprofiler.so CPUPROFILE=cpu.profile ./bin
run google-pprof --text cpu.profile ./bin (switch to --gv, --pdf, --svg, ... depending on what output you prefer)

cachegrind

cachegrind helps diagnose data & instructions cache misses. Note that this works better by compiling lib with a test file rather than linking against it.

Run

compile with -g
run valgrind --tool=cachegrind ./bin
run cg_annotate cachegrind.out.xxxx to get a global report by file
run cg_annotate cachegrind.out.xxxx /abs/path/to/file.cpp to get a line-by-line report for a specific file
run cg_annotate cachegrind.out.xxxx --auto=yes to get a full line-by-line report for all files

Output

Interpretation

I: Instruction Cache
I1: L1 Instruction Cache
LLi: Last Level Instruction Cache (L2 on valgrind, simulates L3/L4/... basically last level for machines with more than 2 cache levels).
D: Data cache
D1: L1 Data cache
LLd: Last Level Data Cache (same as LLi)
r: read
w: write
mr: miss on read
mw: miss on write
Dmr, D1mr, LLdmr, ... combination of cache type and action type (e.g: D1mr is miss on read for L1 Data cache)

Misses are easy to spot, especially on the line-by-line detail.

If one line has a high rate of data misses, it may be interesting to reorganize the way the data is accessed to take profit of cache locality.

Not sure how to address instructions misses.

callgrind

callgrind helps diagnose which instructions are executed the most. Note that this works better by compiling lib with a test file rather than linking against it.

Run

compile with -g
run valgrind --tool=callgrind ./bin
run callgrind_annotate callgrind.out.xxxx to get a global report by file
run callgrind_annotate callgrind.out.xxxx ./relative/path/to/file.cpp to get a line-by-line report for a specific file
run callgrind_annotate callgrind.out.xxxx --auto=yes to get a full line-by-line report for all files

Output

Interpretation

Interpretation is straightforward here: it simply reports how many time a given instruction has been run.

Optimizing highly called areas or reducing the number of times they are executed is the right way to address it (if relevant of course: some instructions are called very frequently but have a slow impact on the overall runtime, so it should be combined with pprof report).

kcachegrind

kcachegrind provides is an alternative to cg_annotate and callgrind_annotate to navigate through the reports. It provides a graphic interface that can be much more convenient depending on the context.

Run

run kcachegrind callgrind.out.xxx or kcachegrind cachegrind.out.xxx

Approach

Top-down approach can be interesting to focus on the relevant areas.

Typically:

identify slow extended APIs based on google tests runtime.
identify slow components with pprof
once identified, find root cause using cachegrind and callgrind

Need more information? Open an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiling

pprof

Run

cachegrind

Run

Output

Interpretation

callgrind

Run

Output

Interpretation

kcachegrind

Run

Approach

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally