Skip to content

Commit e9151f5

Browse files
committed
Add benchmark documentation
1 parent 8efffe1 commit e9151f5

File tree

1 file changed

+126
-0
lines changed

1 file changed

+126
-0
lines changed

benches/BENCHMARK.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Benchmark
2+
3+
Benchmarking is hard, especially when it comes to concurrency. This benchmark focuses on **observation latency** under varying levels of contention, comparing three implementations:
4+
5+
- `go_observe`: Reference Go client algorithm (4 atomic RMW)
6+
- `go_observe_no_count`: Go algorithm without `_count` increment (3 atomic RMW)
7+
- `observe`: This algorithm (3 atomic RMW + cache locality)
8+
9+
The benchmark is parameterized with an optional `spin` value. When not `None`, a background thread performs continuous observations interleaved with `std::hint::spin_loop`[^1] called `spin` times.
10+
11+
## Results
12+
13+
### `ubuntu-24.04` (x86-64)[^2]
14+
```
15+
Timer precision: 15 ns
16+
comparison fastest │ slowest │ median │ mean │ samples │ iters
17+
├─ go_observe │ │ │ │ │
18+
│ ├─ None 27.2 ns │ 634.6 ns │ 27.32 ns │ 27.49 ns │ 486604 │ 31142656
19+
│ ├─ Some(0) 27.26 ns │ 725.2 ns │ 218 ns │ 218.5 ns │ 70100 │ 4486400
20+
│ ├─ Some(1) 27.26 ns │ 762.4 ns │ 212.5 ns │ 211.9 ns │ 72258 │ 4624512
21+
│ ├─ Some(2) 27.28 ns │ 887.8 ns │ 202.7 ns │ 202.3 ns │ 75596 │ 4838144
22+
│ ├─ Some(4) 27.26 ns │ 2.184 µs │ 203 ns │ 192.8 ns │ 79169 │ 5066816
23+
│ ├─ Some(8) 27.28 ns │ 13.66 µs │ 124.1 ns │ 125.9 ns │ 119512 │ 7648768
24+
│ ├─ Some(64) 27.26 ns │ 530.5 ns │ 37.65 ns │ 38.16 ns │ 367199 │ 23500736
25+
│ ╰─ Some(1024) 27.2 ns │ 338.2 ns │ 27.34 ns │ 28.24 ns │ 478825 │ 30644800
26+
├─ go_observe_no_count │ │ │ │ │
27+
│ ├─ None 21.46 ns │ 527.5 ns │ 21.78 ns │ 21.93 ns │ 308587 │ 39499136
28+
│ ├─ Some(0) 21.79 ns │ 613.8 ns │ 151 ns │ 149 ns │ 101782 │ 6514048
29+
│ ├─ Some(1) 21.79 ns │ 1.95 µs │ 162.2 ns │ 163.6 ns │ 93007 │ 5952448
30+
│ ├─ Some(2) 21.78 ns │ 630.7 ns │ 169.4 ns │ 170.9 ns │ 89120 │ 5703680
31+
│ ├─ Some(4) 21.81 ns │ 446.2 ns │ 138.9 ns │ 139.4 ns │ 108660 │ 6954240
32+
│ ├─ Some(8) 21.78 ns │ 610.8 ns │ 84.57 ns │ 85.84 ns │ 173218 │ 11085952
33+
│ ├─ Some(64) 21.84 ns │ 1.063 µs │ 28.13 ns │ 28.66 ns │ 474728 │ 30382592
34+
│ ╰─ Some(1024) 21.65 ns │ 553.4 ns │ 21.96 ns │ 22.58 ns │ 578841 │ 37045824
35+
╰─ observe │ │ │ │ │
36+
├─ None 19.01 ns │ 219.3 ns │ 19.26 ns │ 19.33 ns │ 349330 │ 44714240
37+
├─ Some(0) 19.18 ns │ 723.2 ns │ 101.2 ns │ 101.4 ns │ 147809 │ 9459776
38+
├─ Some(1) 19.2 ns │ 528.4 ns │ 103.9 ns │ 103.7 ns │ 144776 │ 9265664
39+
├─ Some(2) 19.08 ns │ 286.9 ns │ 97.25 ns │ 96.95 ns │ 78034 │ 9988352
40+
├─ Some(4) 19.04 ns │ 443.6 ns │ 62.7 ns │ 63.14 ns │ 117845 │ 15084160
41+
├─ Some(8) 19.03 ns │ 189.3 ns │ 40.78 ns │ 41.16 ns │ 176351 │ 22572928
42+
├─ Some(64) 19.18 ns │ 558.7 ns │ 22.06 ns │ 22.34 ns │ 589850 │ 37750400
43+
╰─ Some(1024) 19.03 ns │ 204.7 ns │ 19.26 ns │ 19.5 ns │ 346030 │ 44291840
44+
```
45+
46+
### `ubuntu-24.04-arm` (aarch64)[^2]
47+
```
48+
Timer precision: 24 ns
49+
comparison fastest │ slowest │ median │ mean │ samples │ iters
50+
├─ go_observe │ │ │ │ │
51+
│ ├─ None 20.82 ns │ 380.8 ns │ 21.07 ns │ 21.13 ns │ 332182 │ 42519296
52+
│ ├─ Some(0) 21.01 ns │ 644.7 ns │ 368.4 ns │ 356.5 ns │ 21776 │ 2787328
53+
│ ├─ Some(1) 20.95 ns │ 527.3 ns │ 364.7 ns │ 362.1 ns │ 21439 │ 2744192
54+
│ ├─ Some(2) 21.01 ns │ 699.4 ns │ 298.5 ns │ 295.9 ns │ 26203 │ 3353984
55+
│ ├─ Some(4) 21.01 ns │ 1.606 µs │ 371.9 ns │ 349.5 ns │ 22210 │ 2842880
56+
│ ├─ Some(8) 21.01 ns │ 2.848 µs │ 158.8 ns │ 166.7 ns │ 46233 │ 5917824
57+
│ ├─ Some(64) 20.95 ns │ 1.001 µs │ 40.82 ns │ 40.68 ns │ 181952 │ 23289856
58+
│ ╰─ Some(1024) 20.7 ns │ 388 ns │ 21.14 ns │ 22.12 ns │ 320441 │ 41016448
59+
├─ go_observe_no_count │ │ │ │ │
60+
│ ├─ None 16.29 ns │ 177.3 ns │ 16.86 ns │ 16.88 ns │ 208365 │ 53341440
61+
│ ├─ Some(0) 16.73 ns │ 655.8 ns │ 357.4 ns │ 319.6 ns │ 12152 │ 3110912
62+
│ ├─ Some(1) 16.82 ns │ 432.7 ns │ 309.4 ns │ 304.8 ns │ 12739 │ 3261184
63+
│ ├─ Some(2) 16.39 ns │ 691.3 ns │ 278.6 ns │ 277.1 ns │ 13999 │ 3583744
64+
│ ├─ Some(4) 16.67 ns │ 439.4 ns │ 314.9 ns │ 302.8 ns │ 12820 │ 3281920
65+
│ ├─ Some(8) 16.79 ns │ 530.8 ns │ 99.17 ns │ 100.7 ns │ 38074 │ 9746944
66+
│ ├─ Some(64) 16.82 ns │ 523 ns │ 28.61 ns │ 28.66 ns │ 127945 │ 32753920
67+
│ ╰─ Some(1024) 16.29 ns │ 449.8 ns │ 17.26 ns │ 17.49 ns │ 201792 │ 51658752
68+
╰─ observe │ │ │ │ │
69+
├─ None 15.67 ns │ 467.2 ns │ 16.01 ns │ 16.01 ns │ 218270 │ 55877120
70+
├─ Some(0) 15.76 ns │ 242.8 ns │ 80.79 ns │ 80.73 ns │ 47297 │ 12108032
71+
├─ Some(1) 15.73 ns │ 458.1 ns │ 120 ns │ 116.6 ns │ 32978 │ 8442368
72+
├─ Some(2) 16.01 ns │ 379.6 ns │ 60.82 ns │ 61.01 ns │ 62141 │ 15908096
73+
├─ Some(4) 15.82 ns │ 239.8 ns │ 75.36 ns │ 71.33 ns │ 53379 │ 13665024
74+
├─ Some(8) 15.89 ns │ 383.8 ns │ 51.54 ns │ 51.65 ns │ 72971 │ 18680576
75+
├─ Some(64) 15.86 ns │ 199.2 ns │ 21.14 ns │ 21.33 ns │ 168309 │ 43087104
76+
╰─ Some(1024) 15.67 ns │ 239.2 ns │ 16.07 ns │ 16.34 ns │ 214428 │ 54893568
77+
```
78+
79+
### MacBook Air M3 (aarch64)
80+
```
81+
Timer precision: 41 ns
82+
comparison fastest │ slowest │ median │ mean │ samples │ iters
83+
├─ go_observe │ │ │ │ │
84+
│ ├─ None 4.2 ns │ 39.07 ns │ 4.282 ns │ 4.299 ns │ 167696 │ 171720704
85+
│ ├─ Some(0) 4.363 ns │ 206.5 ns │ 134.8 ns │ 130.3 ns │ 7412 │ 7589888
86+
│ ├─ Some(1) 4.811 ns │ 159.6 ns │ 110.6 ns │ 119.7 ns │ 8061 │ 8254464
87+
│ ├─ Some(2) 4.81 ns │ 272.6 ns │ 224.3 ns │ 215.9 ns │ 4493 │ 4600832
88+
│ ├─ Some(4) 9.856 ns │ 161.2 ns │ 85.66 ns │ 80.67 ns │ 11901 │ 12186624
89+
│ ├─ Some(8) 4.363 ns │ 85.74 ns │ 20.43 ns │ 20.82 ns │ 43984 │ 45039616
90+
│ ├─ Some(64) 4.769 ns │ 25.48 ns │ 5.787 ns │ 5.781 ns │ 137053 │ 140342272
91+
│ ╰─ Some(1024) 4.322 ns │ 18.23 ns │ 4.689 ns │ 4.682 ns │ 161029 │ 164893696
92+
├─ go_observe_no_count │ │ │ │ │
93+
│ ├─ None 4.119 ns │ 13.51 ns │ 4.241 ns │ 4.348 ns │ 171530 │ 175646720
94+
│ ├─ Some(0) 4.729 ns │ 3.035 µs │ 94.28 ns │ 97.19 ns │ 9901 │ 10138624
95+
│ ├─ Some(1) 4.322 ns │ 124.8 ns │ 80.77 ns │ 83.91 ns │ 11439 │ 11713536
96+
│ ├─ Some(2) 4.322 ns │ 111.2 ns │ 75.85 ns │ 73.39 ns │ 13056 │ 13369344
97+
│ ├─ Some(4) 4.282 ns │ 101.8 ns │ 30.32 ns │ 41.17 ns │ 22927 │ 23477248
98+
│ ├─ Some(8) 4.322 ns │ 34.84 ns │ 11.85 ns │ 11.74 ns │ 74222 │ 76003328
99+
│ ├─ Some(64) 4.363 ns │ 39.96 ns │ 5.381 ns │ 5.393 ns │ 143923 │ 147377152
100+
│ ╰─ Some(1024) 4.241 ns │ 37.85 ns │ 4.688 ns │ 4.661 ns │ 161031 │ 164895744
101+
╰─ observe │ │ │ │ │
102+
├─ None 4.241 ns │ 18.84 ns │ 4.648 ns │ 4.624 ns │ 162579 │ 166480896
103+
├─ Some(0) 4.403 ns │ 97.68 µs │ 10.54 ns │ 410.4 ns │ 2405 │ 2462720
104+
├─ Some(1) 4.403 ns │ 29.29 µs │ 22.99 ns │ 990.1 ns │ 993 │ 1016832
105+
├─ Some(2) 4.403 ns │ 12.14 µs │ 27.19 ns │ 43.83 ns │ 21608 │ 22126592
106+
├─ Some(4) 4.363 ns │ 37.97 ns │ 16.65 ns │ 17.82 ns │ 50943 │ 52165632
107+
├─ Some(8) 4.363 ns │ 70.85 ns │ 11.11 ns │ 11.26 ns │ 77296 │ 79151104
108+
├─ Some(64) 4.404 ns │ 39.07 ns │ 5.624 ns │ 5.577 ns │ 139968 │ 143327232
109+
╰─ Some(1024) 4.363 ns │ 44.07 ns │ 4.77 ns │ 4.764 ns │ 158368 │ 162168832
110+
```
111+
112+
## Analysis
113+
114+
Results are **platform-dependent**:
115+
116+
- The **additional atomic RMW** in `go_observe` has a **significant cost** on Ubuntu runners (x86-64 and aarch64), but is **negligible on Apple M3**.
117+
- **Cache locality** provides **consistent gains across all platforms**, reducing the impact of cache line invalidation from the contending thread.
118+
119+
[^1]: On a MacBook Air M3, one `std::hint::spin_loop` call takes ~8 ns.
120+
[^2]: GitHub Actions workflow run: https://github.com/wyfo/split-histogram/actions/runs/18954432694
121+
122+
## Analysis
123+
124+
The **additional atomic RMW** in `go_observe` (the `_count` increment) has a **measurable cost** across **all platforms**, with the sole exception of Apple M3 in uncontended scenario.
125+
126+
**Cache locality**, enabled by grouping all shard counters in a single cache line, delivers **consistent performance improvements across all platforms**, significantly reducing the impact of cache line invalidation triggered by the contending thread.

0 commit comments

Comments
 (0)