Skip to content

Commit ec7bfae

Browse files
authored
subsystem-bench: cache misses profiling (#2893)
## Why we need it To provide another level of understanding to why polkadot's subsystems may perform slower than expected. Cache misses occur when processing large amounts of data, such as during availability recovery. ## Why Cachegrind Cachegrind has many drawbacks: it is slow, it uses its own cache simulation, which is very basic. But unlike `perf`, which is a great tool, Cachegrind can run in a virtual machine. This means we can easily run it in remote installations and even use it in CI/CD to catch possible regressions. Why Cachegrind and not Callgrind, another part of Valgrind? It is simply empirically proven that profiling runs faster with Cachegrind. ## First results First results have been obtained while testing of the approach. Here is an example. ``` $ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read $ cat cachegrind_report.txt I refs: 64,622,081,485 I1 misses: 3,018,168 LLi misses: 437,654 I1 miss rate: 0.00% LLi miss rate: 0.00% D refs: 12,161,833,115 (9,868,356,364 rd + 2,293,476,751 wr) D1 misses: 167,940,701 ( 71,060,073 rd + 96,880,628 wr) LLd misses: 33,550,018 ( 16,685,853 rd + 16,864,165 wr) D1 miss rate: 1.4% ( 0.7% + 4.2% ) LLd miss rate: 0.3% ( 0.2% + 0.7% ) LL refs: 170,958,869 ( 74,078,241 rd + 96,880,628 wr) LL misses: 33,987,672 ( 17,123,507 rd + 16,864,165 wr) LL miss rate: 0.0% ( 0.0% + 0.7% ) ``` The CLI output shows that 1.4% of the L1 data cache missed, which is not so bad, given that the last-level cache had that data most of the time missing only 0.3%. Instruction data of the L1 has 0.00% misses of the time. Looking at an output file with `cg_annotate` shows that most of the misses occur during reed-solomon, which is expected.
1 parent 82c057e commit ec7bfae

File tree

3 files changed

+121
-18
lines changed

3 files changed

+121
-18
lines changed

polkadot/node/subsystem-bench/README.md

Lines changed: 60 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -117,23 +117,24 @@ used to run a suite of tests defined in a `yaml` file like in this [example](exa
117117

118118
```
119119
Options:
120-
--network <NETWORK> The type of network to be emulated [default: ideal] [possible values:
121-
ideal, healthy, degraded]
122-
--n-cores <N_CORES> Number of cores to fetch availability for [default: 100]
123-
--n-validators <N_VALIDATORS> Number of validators to fetch chunks from [default: 500]
124-
--min-pov-size <MIN_POV_SIZE> The minimum pov size in KiB [default: 5120]
125-
--max-pov-size <MAX_POV_SIZE> The maximum pov size bytes [default: 5120]
126-
-n, --num-blocks <NUM_BLOCKS> The number of blocks the test is going to run [default: 1]
127-
-p, --peer-bandwidth <PEER_BANDWIDTH> The bandwidth of simulated remote peers in KiB
128-
-b, --bandwidth <BANDWIDTH> The bandwidth of our simulated node in KiB
129-
--peer-error <PEER_ERROR> Simulated conection error ratio [0-100]
130-
--peer-min-latency <PEER_MIN_LATENCY> Minimum remote peer latency in milliseconds [0-5000]
131-
--peer-max-latency <PEER_MAX_LATENCY> Maximum remote peer latency in milliseconds [0-5000]
132-
--profile Enable CPU Profiling with Pyroscope
133-
--pyroscope-url <PYROSCOPE_URL> Pyroscope Server URL [default: http://localhost:4040]
134-
--pyroscope-sample-rate <PYROSCOPE_SAMPLE_RATE> Pyroscope Sample Rate [default: 113]
135-
-h, --help Print help
136-
-V, --version Print version
120+
--network <NETWORK> The type of network to be emulated [default: ideal] [possible
121+
values: ideal, healthy, degraded]
122+
--n-cores <N_CORES> Number of cores to fetch availability for [default: 100]
123+
--n-validators <N_VALIDATORS> Number of validators to fetch chunks from [default: 500]
124+
--min-pov-size <MIN_POV_SIZE> The minimum pov size in KiB [default: 5120]
125+
--max-pov-size <MAX_POV_SIZE> The maximum pov size bytes [default: 5120]
126+
-n, --num-blocks <NUM_BLOCKS> The number of blocks the test is going to run [default: 1]
127+
-p, --peer-bandwidth <PEER_BANDWIDTH> The bandwidth of simulated remote peers in KiB
128+
-b, --bandwidth <BANDWIDTH> The bandwidth of our simulated node in KiB
129+
--peer-error <PEER_ERROR> Simulated conection error ratio [0-100]
130+
--peer-min-latency <PEER_MIN_LATENCY> Minimum remote peer latency in milliseconds [0-5000]
131+
--peer-max-latency <PEER_MAX_LATENCY> Maximum remote peer latency in milliseconds [0-5000]
132+
--profile Enable CPU Profiling with Pyroscope
133+
--pyroscope-url <PYROSCOPE_URL> Pyroscope Server URL [default: http://localhost:4040]
134+
--pyroscope-sample-rate <PYROSCOPE_SAMPLE_RATE> Pyroscope Sample Rate [default: 113]
135+
--cache-misses Enable Cache Misses Profiling with Valgrind. Linux only, Valgrind
136+
must be in the PATH
137+
-h, --help Print help
137138
```
138139

139140
These apply to all test objectives, except `test-sequence` which relies on the values being specified in a file.
@@ -221,6 +222,48 @@ view the test progress in real time by accessing [this link](http://localhost:30
221222
Now run
222223
`target/testnet/subsystem-bench test-sequence --path polkadot/node/subsystem-bench/examples/availability_read.yaml`
223224
and view the metrics in real time and spot differences between different `n_validators` values.
225+
226+
### Profiling cache misses
227+
228+
Cache misses are profiled using Cachegrind, part of Valgrind. Cachegrind runs slowly, and its cache simulation is basic
229+
and unlikely to reflect the behavior of a modern machine. However, it still represents the general situation with cache
230+
usage, and more importantly it doesn't require a bare-metal machine to run on, which means it could be run in CI or in
231+
a remote virtual installation.
232+
233+
To profile cache misses use the `--cache-misses` flag. Cache simulation of current runs tuned for Intel Ice Lake CPU.
234+
Since the execution will be very slow, it's recommended not to run it together with other profiling and not to take
235+
benchmark results into account. A report is saved in a file `cachegrind_report.txt`.
236+
237+
Example run results:
238+
```
239+
$ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read
240+
$ cat cachegrind_report.txt
241+
I refs: 64,622,081,485
242+
I1 misses: 3,018,168
243+
LLi misses: 437,654
244+
I1 miss rate: 0.00%
245+
LLi miss rate: 0.00%
246+
247+
D refs: 12,161,833,115 (9,868,356,364 rd + 2,293,476,751 wr)
248+
D1 misses: 167,940,701 ( 71,060,073 rd + 96,880,628 wr)
249+
LLd misses: 33,550,018 ( 16,685,853 rd + 16,864,165 wr)
250+
D1 miss rate: 1.4% ( 0.7% + 4.2% )
251+
LLd miss rate: 0.3% ( 0.2% + 0.7% )
252+
253+
LL refs: 170,958,869 ( 74,078,241 rd + 96,880,628 wr)
254+
LL misses: 33,987,672 ( 17,123,507 rd + 16,864,165 wr)
255+
LL miss rate: 0.0% ( 0.0% + 0.7% )
256+
```
257+
258+
The results show that 1.4% of the L1 data cache missed, but the last level cache only missed 0.3% of the time.
259+
Instruction data of the L1 has 0.00%.
260+
261+
Cachegrind writes line-by-line cache profiling information to a file named `cachegrind.out.<pid>`.
262+
This file is best interpreted with `cg_annotate --auto=yes cachegrind.out.<pid>`. For more information see the
263+
[cachegrind manual](https://www.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-40/Nice/RuleRefinement/bin/valgrind-3.2.0/docs/html/cg-manual.html).
264+
265+
For finer profiling of cache misses, better use `perf` on a bare-metal machine.
266+
224267
## Create new test objectives
225268

226269
This tool is intended to make it easy to write new test objectives that focus individual subsystems,

polkadot/node/subsystem-bench/src/subsystem-bench.rs

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
//! A tool for running subsystem benchmark tests designed for development and
1818
//! CI regression testing.
19+
1920
use clap::Parser;
2021
use color_eyre::eyre;
2122
use pyroscope::PyroscopeAgent;
@@ -27,6 +28,7 @@ use std::{path::Path, time::Duration};
2728
pub(crate) mod availability;
2829
pub(crate) mod cli;
2930
pub(crate) mod core;
31+
mod valgrind;
3032

3133
use availability::{prepare_test, NetworkEmulation, TestState};
3234
use cli::TestObjective;
@@ -90,12 +92,21 @@ struct BenchCli {
9092
/// Pyroscope Sample Rate
9193
pub pyroscope_sample_rate: u32,
9294

95+
#[clap(long, default_value_t = false)]
96+
/// Enable Cache Misses Profiling with Valgrind. Linux only, Valgrind must be in the PATH
97+
pub cache_misses: bool,
98+
9399
#[command(subcommand)]
94100
pub objective: cli::TestObjective,
95101
}
96102

97103
impl BenchCli {
98104
fn launch(self) -> eyre::Result<()> {
105+
let is_valgrind_running = valgrind::is_valgrind_running();
106+
if !is_valgrind_running && self.cache_misses {
107+
return valgrind::relaunch_in_valgrind_mode()
108+
}
109+
99110
let agent_running = if self.profile {
100111
let agent = PyroscopeAgent::builder(self.pyroscope_url.as_str(), "subsystem-bench")
101112
.backend(pprof_backend(PprofConfig::new().sample_rate(self.pyroscope_sample_rate)))
@@ -185,7 +196,7 @@ impl BenchCli {
185196

186197
let mut state = TestState::new(&test_config);
187198
let (mut env, _protocol_config) = prepare_test(test_config, &mut state);
188-
// test_config.write_to_disk();
199+
189200
env.runtime()
190201
.block_on(availability::benchmark_availability_read(&mut env, state));
191202

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
// Copyright (C) Parity Technologies (UK) Ltd.
2+
// This file is part of Polkadot.
3+
4+
// Polkadot is free software: you can redistribute it and/or modify
5+
// it under the terms of the GNU General Public License as published by
6+
// the Free Software Foundation, either version 3 of the License, or
7+
// (at your option) any later version.
8+
9+
// Polkadot is distributed in the hope that it will be useful,
10+
// but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
// GNU General Public License for more details.
13+
14+
// You should have received a copy of the GNU General Public License
15+
// along with Polkadot. If not, see <http://www.gnu.org/licenses/>.
16+
17+
use color_eyre::eyre;
18+
19+
/// Show if the app is running under Valgrind
20+
pub(crate) fn is_valgrind_running() -> bool {
21+
match std::env::var("LD_PRELOAD") {
22+
Ok(v) => v.contains("valgrind"),
23+
Err(_) => false,
24+
}
25+
}
26+
27+
/// Stop execution and relaunch the app under valgrind
28+
/// Cache configuration used to emulate Intel Ice Lake (size, associativity, line size):
29+
/// L1 instruction: 32,768 B, 8-way, 64 B lines
30+
/// L1 data: 49,152 B, 12-way, 64 B lines
31+
/// Last-level: 2,097,152 B, 16-way, 64 B lines
32+
pub(crate) fn relaunch_in_valgrind_mode() -> eyre::Result<()> {
33+
use std::os::unix::process::CommandExt;
34+
let err = std::process::Command::new("valgrind")
35+
.arg("--tool=cachegrind")
36+
.arg("--cache-sim=yes")
37+
.arg("--log-file=cachegrind_report.txt")
38+
.arg("--I1=32768,8,64")
39+
.arg("--D1=49152,12,64")
40+
.arg("--LL=2097152,16,64")
41+
.arg("--verbose")
42+
.args(std::env::args())
43+
.exec();
44+
45+
Err(eyre::eyre!(
46+
"Сannot run Valgrind, check that it is installed and available in the PATH\n{}",
47+
err
48+
))
49+
}

0 commit comments

Comments
 (0)