Skip to content

Commit c068149

Browse files
committed
futex: add benchmark test
This patch adds a new benchmark - misc-futex-perf.cc. The goal is to indirectly measure performance of the futex syscall implemented in OSv and compare it to Linux and thus guide us in implementing the improvements described in the issue #853. The benchmark program does it by implementing special mutex - fmutex - based on futex syscall according to the algorithm specified in the Ulrich Drepper's paper "Futexes Are Tricky". The test is similar to the misc-mutex2.cc written by Nadav Har'El. It takes three parameters: mandatory number of threads (nthreads) and a computation length (worklen) and optional number of mutexes (nmutexes) which is equal to 1 by default. The test groups all threads (nthreads * nmutexes) into nmutexes sets of nthreads threads trying to take the group mutex (one out of nmutexes) in a loop and increment the group counter and then do some short computation of the specified length outside the loop. The test runs for 30 seconds, and shows the average total number of lock-protected counter increments per second. The number of cpus is set by using the '-c' option passes to run.py in case of OSv, and using taskset -c 0..n when running the same program on host. The results of the test that show number of total increments (across counters of all groups of threads) per second for both OSv and Linux host are below. It also shows number of total futex syscall calls (wake) captured by adding an atomic counter in the futex implementation for OSv. +------------------+----------------------------+----------------------+ | Run parameters | On OSv guest | On Linux host (op/s) | | | (op/s) (futex called) | +------------------+----------------------------+----------------------+ | 1 0 1 (1 cpu) | 5.1353e+07 0 | 5.21169e+07 | | 2 0 1 (2 cpus) | 2.26067e+07 345,745 | 1.78575e+07 | | 4 0 1 (4 cpus) | 4.93204e+07 2342 | 1.41494e+07 | | 1 500 1 (1 cpu) | 5.67558e+06 0 | 5.75555e+06 | | 2 500 1 (2 cpus) | 9.19294e+06 3618 | 9.78263e+06 | | 4 500 1 (4 cpus) | 5.65933e+06 38,243 | 6.87465e+06 | | 4 500 2 (4 cpus) | 8.30834e+06 266 | 1.15537e+07 | | 4 500 4 (4 cpus) | 1.06216e+07 111 | 1.16908e+07 | | 4 500 8 (4 cpus) | 1.39291e+07 101 | 1.31845e+07 | +------------------+----------------------------+----------------------+ The results are surprising and somewhat confusing. For example the lines 2 and 3 show OSv outperforming Linux by a lot. Also the line 7 (4 500 2) shows OSv peformance worse by ~30% even when number of futex calls is pretty low. Possibly there is a flaw in this test, or some kind of different explanation. Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com> Message-Id: <20220907032208.20291-1-jwkozaczuk@gmail.com>
1 parent c8cdbde commit c068149

File tree

2 files changed

+192
-1
lines changed

2 files changed

+192
-1
lines changed

modules/tests/Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,8 @@ tests := tst-pthread.so misc-ramdisk.so tst-vblk.so tst-bsd-evh.so \
134134
tst-elf-init.so tst-realloc.so tst-setjmp.so \
135135
libtls.so libtls_gold.so tst-tls.so tst-tls-gold.so tst-tls-pie.so \
136136
tst-sigaction.so tst-syscall.so tst-ifaddrs.so tst-getdents.so \
137-
tst-netlink.so misc-zfs-io.so misc-zfs-arc.so tst-pthread-create.so
137+
tst-netlink.so misc-zfs-io.so misc-zfs-arc.so tst-pthread-create.so \
138+
misc-futex-perf.so
138139
# libstatic-thread-variable.so tst-static-thread-variable.so \
139140
140141
ifeq ($(arch),x64)

tests/misc-futex-perf.cc

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
/*
2+
* Copyright (C) 2022 Waldemar Kozaczuk
3+
*
4+
* This work is open source software, licensed under the terms of the
5+
* BSD license as described in the LICENSE file in the top-level directory.
6+
*/
7+
#include <stdint.h>
8+
#include <unistd.h>
9+
#include <sys/syscall.h>
10+
#include <sys/sysinfo.h>
11+
#include <linux/futex.h>
12+
#include <thread>
13+
#include <chrono>
14+
#include <iostream>
15+
#include <vector>
16+
17+
// This test is based on misc-mutex2.cc written by Nadav Har'El. But unlike
18+
// the other one, it focuses on measuring the performance of the futex()
19+
// syscall implementation. It does it indirectly by implementing mutex based
20+
// on futex syscall according to the formula specified in the Ulrich Drepper's
21+
// paper "Futexes Are Tricky".
22+
// It takes three parameters: mandatory number of threads (nthreads) and
23+
// a computation length (worklen) and optional number of mutexes (nmutexes).
24+
// The test groups all threads (nthreads * nmutexes) into nmutexes sets
25+
// where nthreads threads loop trying to take the group mutex (one out of nmutexes)
26+
// and increment the group counter and then do some short computation of the
27+
// specified length outside the loop. The test runs for 30 seconds, and
28+
// shows the average number of lock-protected counter increments per second.
29+
// The reason for doing some computation outside the lock is that makes the
30+
// benchmark more realistic, reduces the level of contention and makes it
31+
// beneficial for the OS to run the different threads on different CPUs:
32+
// Without any computation outside the lock, the best performance will be
33+
// achieved by running all the threads.
34+
35+
// Turn off optimization, as otherwise the compiler will optimize
36+
// out calls to fmutex lock() and unlock() as they seem to do nothing
37+
#pragma GCC optimize("00")
38+
39+
// Wrapper function that performs the same functionality as described
40+
// in the Drepper's paper (see below).
41+
// It atomically compares the value pointed by the address addr to the value expected
42+
// and only if equal replaces *addr with desired. In either case it returns the value
43+
// at *addr before the operation.
44+
inline uint32_t cmpxchg(uint32_t *addr, uint32_t expected, uint32_t desired)
45+
{
46+
uint32_t *expected_addr = &expected;
47+
__atomic_compare_exchange_n(addr, expected_addr, desired, false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
48+
return *expected_addr;
49+
}
50+
51+
enum {
52+
UNLOCKED = 0,
53+
LOCKED_NO_WAITERS = 1,
54+
LOCKED_MAYBE_WAITERS = 2,
55+
};
56+
57+
// This futex-based mutex implementation is based on the example "Mutex, Take 2"
58+
// from the Ulrich Drepper's paper "Futexes Are Tricky" (https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf)
59+
class fmutex {
60+
public:
61+
fmutex() : _state(UNLOCKED) {}
62+
void lock()
63+
{
64+
uint32_t c;
65+
// If the state was UNLOCKED before cmpxchg, we do not have to do anything
66+
// just return after setting to LOCKED_NO_WAITERS
67+
if ((c = cmpxchg(&_state, UNLOCKED, LOCKED_NO_WAITERS)) != UNLOCKED) {
68+
do {
69+
// It was locked, so let us set the state to LOCKED_MAYBE_WAITERS.
70+
// It might be already in this state (1st part of if below) or
71+
// the state was LOCKED_NO_WAITERS so let us change it to LOCKED_MAYBE_WAITERS
72+
if (c == LOCKED_MAYBE_WAITERS ||
73+
cmpxchg(&_state, LOCKED_NO_WAITERS, LOCKED_MAYBE_WAITERS) != UNLOCKED) {
74+
// Wait until kernel tells us the state is different from LOCKED_MAYBE_WAITERS
75+
syscall(SYS_futex, &_state, FUTEX_WAIT_PRIVATE, LOCKED_MAYBE_WAITERS, 0, 0, 0);
76+
}
77+
// At this point we are either because:
78+
// 1. The mutex was indeed UNLOCKED = the if condition above was false
79+
// 2. We were awoken when sleeping upon making the syscall FUTEX_WAIT_PRIVATE
80+
// So let us try to lock again. Because we do not know if there any waiters
81+
// we try to set to LOCKED_MAYBE_WAITERS and err on the safe side.
82+
} while ((c = cmpxchg(&_state, UNLOCKED, LOCKED_MAYBE_WAITERS)) != UNLOCKED);
83+
}
84+
}
85+
86+
void unlock()
87+
{
88+
// Let us wake one waiter only if the state was LOCKED_MAYBE_WAITERS
89+
// Otherwise do nothing if uncontended
90+
if (__atomic_fetch_sub(&_state, 1, __ATOMIC_SEQ_CST) != LOCKED_NO_WAITERS) {
91+
_state = UNLOCKED;
92+
syscall(SYS_futex, &_state, FUTEX_WAKE_PRIVATE, 1, 0, 0, 0);
93+
}
94+
}
95+
private:
96+
uint32_t _state;
97+
};
98+
99+
void loop(int iterations)
100+
{
101+
for (register int i=0; i<iterations; i++) {
102+
// To force gcc to not optimize this loop away
103+
asm volatile("" : : : "memory");
104+
}
105+
}
106+
107+
int main(int argc, char** argv) {
108+
if (argc <= 2) {
109+
std::cerr << "Usage: " << argv[0] << " nthreads worklen <nmutexes>\n";
110+
return 1;
111+
}
112+
int nthreads = atoi(argv[1]);
113+
if (nthreads <= 0) {
114+
std::cerr << "Usage: " << argv[0] << " nthreads worklen <nmutexes>\n";
115+
return 2;
116+
}
117+
// "worklen" is the amount of work to do in each loop iteration, outside
118+
// the mutex. This reduces contention and makes the benchmark more
119+
// realistic and gives it the theoretic possibility of achieving better
120+
// benchmark numbers on multiple CPUs (because this "work" is done in
121+
// parallel.
122+
int worklen = atoi(argv[2]);
123+
if (worklen < 0) {
124+
std::cerr << "Usage: " << argv[0] << " nthreads worklen <nmutexes>\n";
125+
return 3;
126+
}
127+
128+
// "nmutexes" is the number of mutexes the threads will be contending for
129+
// we will group threads by set of nthreads contending on individual mutex
130+
// to increase corresponding group counter
131+
int nmutexes = 1;
132+
if (argc >= 4) {
133+
nmutexes = atoi(argv[3]);
134+
if (nmutexes < 0)
135+
nmutexes = 1;
136+
}
137+
138+
int concurrency = 0;
139+
cpu_set_t cs;
140+
sched_getaffinity(0, sizeof(cs), &cs);
141+
for (int i = 0; i < get_nprocs(); i++) {
142+
if (CPU_ISSET(i, &cs)) {
143+
concurrency++;
144+
}
145+
}
146+
std::cerr << "Running " << (nthreads * nmutexes) << " threads on " <<
147+
concurrency << " cores with " << nmutexes <<
148+
" mutexes. Worklen = " <<
149+
worklen << "\n";
150+
151+
// Set secs to the desired number of seconds a measurement should
152+
// take. Note that the whole test will take several times longer than
153+
// secs, as we do several tests each lasting at least this long.
154+
double secs = 30.0;
155+
156+
// Our mutex-protected operation will be a silly increment of a counter,
157+
// taking a tiny amount of time, but still can happen concurrently if
158+
// run very frequently from many cores in parallel.
159+
long counters[nmutexes] = {0};
160+
bool done = false;
161+
162+
fmutex mut[nmutexes];
163+
std::vector<std::thread> threads;
164+
for (int m = 0; m < nmutexes; m++) {
165+
for (int i = 0; i < nthreads; i++) {
166+
threads.push_back(std::thread([&, m]() {
167+
while (!done) {
168+
mut[m].lock();
169+
counters[m]++;
170+
mut[m].unlock();
171+
loop(worklen);
172+
}
173+
}));
174+
}
175+
}
176+
threads.push_back(std::thread([&]() {
177+
std::this_thread::sleep_for(std::chrono::duration<double>(secs));
178+
done = true;
179+
}));
180+
for (auto &t : threads) {
181+
t.join();
182+
}
183+
long total = 0;
184+
for (int m = 0; m < nmutexes; m++) {
185+
total += counters[m];
186+
}
187+
std::cout << total << " counted in " << secs << " seconds (" << (total/secs) << " per sec)\n";
188+
189+
return 0;
190+
}

0 commit comments

Comments
 (0)