Skip to content

Commit 48c1800

Browse files
authored
Merge pull request ceph#60822 from perezjosibm/wip-perezjos-balance-cpu
[vstart]: add --crimson-balance-cpu option to set CPU distribution policy
2 parents 0f73780 + 8b264e6 commit 48c1800

11 files changed

+1633
-19
lines changed
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Balance CPU Crimson.
2+
3+
----------
4+
5+
We introduced the following utilities to help analysing the Performance impact of two strategies for
6+
allocation of CPU cores to Seastar reactor threads. This is limited to a single host deployment at the moment.
7+
8+
- OSD-based: this consists on allocating CPU cores from the same NUMA socket to the same OSD.
9+
for simplicity, if the OSD id is even, all its reactor threads are allocated to NUMA socket 0, and
10+
consequently if the OSD is is odd, all its reactor threads are allocated to NUMA socket 1.
11+
12+
- NUMA socket based: this consists of allocating evenly CPU cores from each NUMA socket to the reactors, so
13+
all the OSD end up with reactor on both NUMA sockets.
14+
15+
A new option `--crimson-balance-cpu <osd|socket>` has been implemented in `vstart.sh` to support these strategies.
16+
17+
Worth pointing out, there are *three* CPU allocation strategies:
18+
19+
- when the new flag is not specified (default), Seastar reactors to use CPUs in ascending contiguous order (unbalanced across sockets),
20+
- osd: distribute across sockets uniformly, don't split within an OSD,
21+
- socket: distribute across sockets uniformly, split within an OSD.
22+
23+
The utilities introduced are:
24+
25+
- `balance-cpu.py`: a stand-alone script to produce the list of CPU core ids to use by `vstart.sh` when allocating
26+
Seastar reactor threads. It uses as input the .json produced by `lscpu.py`.
27+
- `lscpu.py`: a Python module to parse the .json file created by `lscpu --json`. This produces a Python dictionary
28+
with the NUMA details, that is, number of sockets, range of CPU core ids (physical and HT-siblings).
29+
- `tasksetcpu.py`: a stand-alone script to produce a grid showing the current CPU allocation, useful to quickly
30+
visualise the allocation strategy.
31+
32+
## Usage:
33+
34+
The following is a typical example of creating a cluster with three OSDs and three reactors per OSD, and
35+
the desired CPU allocation policy:
36+
37+
```
38+
# MDS=0 MON=1 OSD=3 MGR=1 /ceph/src/vstart.sh --new -x --localhost --without-dashboard --cyanstore --redirect-output --crimson --crimson-smp 3 --no-restart --crimson-balance-cpu osd
39+
```
40+
41+
The following is the corresponding CPU distribution:
42+
43+
![cyan_3osd_3react_bal_osd](./cyan_3osd_3react_bal_osd.png)
44+
45+
The following snippet shows the typical usage of the `balance-cpu.py` script:
46+
47+
```
48+
lscpu --json > /tmp/numa_nodes.json
49+
python3 ${CEPH_DIR}/../src/tools/contrib/balance-cpu.py -o $CEPH_NUM_OSD -r $crimson_smp \
50+
-b $balance_strategy -u /tmp/numa_nodes.json > /tmp/numa_args.out
51+
```
52+
* the accepted balance strategies are "osd" or "socket".
53+
* the file produced `/tmp/numa_args.out` contains the list of CPU ids that `vstart.sh` consumes to issue the corresponding ceph configuration commands.
54+
55+
The grid can be printed as follows:
56+
57+
```
58+
[ ! -f "${NUMA_NODES_OUT}" ] && lscpu --json > ${NUMA_NODES_OUT}
59+
python3 /ceph/src/tools/contrib/tasksetcpu.py -c $TEST_NAME -u ${NUMA_NODES_OUT} -d ${RUN_DIR}
60+
```
61+
62+
## Performance
63+
64+
The following charts show the comparison of IOPs for the three CPU allocation policies: default
65+
(contiguous allocation, no balance), OSD-based, NUMA socket-based. It is interesting to note that
66+
there does not seem to be any significant throughput degradation, for this small configuration
67+
(3 OSD, 3 reactors). However, the OSD-based allocation requires higher memory utilisation than the other
68+
two configurations, which is an interesting finding and requires further investigation.
69+
70+
71+
![cyan_3osd_3react_bal_vs_unbal_4krandread_iops_vs_lat](./cyan_3osd_3react_bal_vs_unbal_4krandread_iops_vs_lat.png)
72+
73+
![cyan_3osd_3react_bal_vs_unbal_4krandread_osd_cpu](./cyan_3osd_3react_bal_vs_unbal_4krandread_osd_cpu.png)
74+
75+
![cyan_3osd_3react_bal_vs_unbal_4krandread_osd_mem](./cyan_3osd_3react_bal_vs_unbal_4krandread_osd_mem.png)
34.7 KB
Loading
34.2 KB
Loading
34 KB
Loading
20.8 KB
Loading

src/stop.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,3 +234,8 @@ else
234234
[ $stop_rgw -eq 1 ] && do_killall radosgw lt-radosgw apache2
235235
[ $stop_cephadm -eq 1 ] && do_killcephadm
236236
fi
237+
238+
# Check whether the --crimson-balance-cpu option was used, if so remove any auxiliary files left:
239+
if [ "$ceph_osd" == "crimson-osd" ] && [ -f /tmp/numa_args_*.out ]; then
240+
rm -f /tmp/numa_args_*.out
241+
fi

src/tools/contrib/README.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,21 @@ Please do not assume any level of support. Your mileage may vary.
88

99
Each file's header must include a tracker number and an author signed-off-by
1010
line.
11+
12+
13+
- balance-cpu.py. An utility to distribute the Seastar reactor threads over the
14+
(physical) CPU cores, according to two strategies:
15+
- OSD-based (default): allocates all the reactors of the same OSD in the same
16+
NUMA socket,
17+
- NUMA socket: distributes the reactors of each OSD evenly in the NUMA sockets
18+
(normally two), so every OSD ends up with reactors running on both NUMA sockets.
19+
20+
- lscpu.py. A Python module to parse the output of ``lscpu --json`` into a dictionary
21+
which is used by balance-cpu and tasksetcpu.py.
22+
23+
- tasksetcpu.py. an utility to print a grid showing the current CPU core allocation
24+
of Seastar reactors. Useful to validate that the allocation strategy is correct.
25+
26+
For further details, please see *BalanceCPUCrimson.md* in doc/dev/crimson.
27+
28+

0 commit comments

Comments
 (0)