Skip to content

Commit 34235a3

Browse files
author
Alexei Starovoitov
committed
Merge branch 'limited-queueing-in-nmi-for-rqspinlock'
Kumar Kartikeya Dwivedi says: ==================== Limited queueing in NMI for rqspinlock Ritesh reported that he was frequently seeing timeouts in cases which should have been covered by the AA heuristics. This led to the discovery of multiple gaps in the current code that could lead to timeouts when AA heuristics could work to prevent them. More details and investigation is available in the original threads. [0][1] This set restores the ability for NMI waiters to queue in the slow path, and reduces the cases where they would attempt to trylock. However, such queueing must not happen when interrupting waiters which the NMI itself depends upon for forward progress; in those cases the trylock fallback remains, but with a single attempt to avoid aimless attempts to acquire the lock. It also closes a possible window in the lock fast path and the unlock path where NMIs landing between cmpxchg and entry creation, or entry deletion and unlock would miss the detection of an AA scenario and end up timing out. This virtually eliminates all the cases where existing heuristics can prevent timeouts and quickly recover from a deadlock. More details are available in the commit logs for each patch. [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com [1]: https://lore.kernel.org/bpf/[email protected] ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2 parents bd5bdd2 + 3448375 commit 34235a3

File tree

3 files changed

+108
-76
lines changed

3 files changed

+108
-76
lines changed

include/asm-generic/rqspinlock.h

Lines changed: 32 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -129,8 +129,8 @@ static __always_inline void release_held_lock_entry(void)
129129
* <error> for lock B
130130
* release_held_lock_entry
131131
*
132-
* try_cmpxchg_acquire for lock A
133132
* grab_held_lock_entry
133+
* try_cmpxchg_acquire for lock A
134134
*
135135
* Lack of any ordering means reordering may occur such that dec, inc
136136
* are done before entry is overwritten. This permits a remote lock
@@ -139,13 +139,8 @@ static __always_inline void release_held_lock_entry(void)
139139
* CPU holds a lock it is attempting to acquire, leading to false ABBA
140140
* diagnosis).
141141
*
142-
* In case of unlock, we will always do a release on the lock word after
143-
* releasing the entry, ensuring that other CPUs cannot hold the lock
144-
* (and make conclusions about deadlocks) until the entry has been
145-
* cleared on the local CPU, preventing any anomalies. Reordering is
146-
* still possible there, but a remote CPU cannot observe a lock in our
147-
* table which it is already holding, since visibility entails our
148-
* release store for the said lock has not retired.
142+
* The case of unlock is treated differently due to NMI reentrancy, see
143+
* comments in res_spin_unlock.
149144
*
150145
* In theory we don't have a problem if the dec and WRITE_ONCE above get
151146
* reordered with each other, we either notice an empty NULL entry on
@@ -175,10 +170,22 @@ static __always_inline int res_spin_lock(rqspinlock_t *lock)
175170
{
176171
int val = 0;
177172

178-
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) {
179-
grab_held_lock_entry(lock);
173+
/*
174+
* Grab the deadlock detection entry before doing the cmpxchg, so that
175+
* reentrancy due to NMIs between the succeeding cmpxchg and creation of
176+
* held lock entry can correctly detect an acquisition attempt in the
177+
* interrupted context.
178+
*
179+
* cmpxchg lock A
180+
* <NMI>
181+
* res_spin_lock(A) --> missed AA, leads to timeout
182+
* </NMI>
183+
* grab_held_lock_entry(A)
184+
*/
185+
grab_held_lock_entry(lock);
186+
187+
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
180188
return 0;
181-
}
182189
return resilient_queued_spin_lock_slowpath(lock, val);
183190
}
184191

@@ -192,28 +199,25 @@ static __always_inline void res_spin_unlock(rqspinlock_t *lock)
192199
{
193200
struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
194201

195-
if (unlikely(rqh->cnt > RES_NR_HELD))
196-
goto unlock;
197-
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
198-
unlock:
199202
/*
200-
* Release barrier, ensures correct ordering. See release_held_lock_entry
201-
* for details. Perform release store instead of queued_spin_unlock,
202-
* since we use this function for test-and-set fallback as well. When we
203-
* have CONFIG_QUEUED_SPINLOCKS=n, we clear the full 4-byte lockword.
203+
* Release barrier, ensures correct ordering. Perform release store
204+
* instead of queued_spin_unlock, since we use this function for the TAS
205+
* fallback as well. When we have CONFIG_QUEUED_SPINLOCKS=n, we clear
206+
* the full 4-byte lockword.
204207
*
205-
* Like release_held_lock_entry, we can do the release before the dec.
206-
* We simply care about not seeing the 'lock' in our table from a remote
207-
* CPU once the lock has been released, which doesn't rely on the dec.
208+
* Perform the smp_store_release before clearing the lock entry so that
209+
* NMIs landing in the unlock path can correctly detect AA issues. The
210+
* opposite order shown below may lead to missed AA checks:
208211
*
209-
* Unlike smp_wmb(), release is not a two way fence, hence it is
210-
* possible for a inc to move up and reorder with our clearing of the
211-
* entry. This isn't a problem however, as for a misdiagnosis of ABBA,
212-
* the remote CPU needs to hold this lock, which won't be released until
213-
* the store below is done, which would ensure the entry is overwritten
214-
* to NULL, etc.
212+
* WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
213+
* <NMI>
214+
* res_spin_lock(A) --> missed AA, leads to timeout
215+
* </NMI>
216+
* smp_store_release(A->locked, 0)
215217
*/
216218
smp_store_release(&lock->locked, 0);
219+
if (likely(rqh->cnt <= RES_NR_HELD))
220+
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
217221
this_cpu_dec(rqspinlock_held_locks.cnt);
218222
}
219223

kernel/bpf/rqspinlock.c

Lines changed: 33 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -196,32 +196,21 @@ static noinline int check_deadlock_ABBA(rqspinlock_t *lock, u32 mask)
196196
return 0;
197197
}
198198

199-
static noinline int check_deadlock(rqspinlock_t *lock, u32 mask)
200-
{
201-
int ret;
202-
203-
ret = check_deadlock_AA(lock);
204-
if (ret)
205-
return ret;
206-
ret = check_deadlock_ABBA(lock, mask);
207-
if (ret)
208-
return ret;
209-
210-
return 0;
211-
}
212-
213199
static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
214200
struct rqspinlock_timeout *ts)
215201
{
216-
u64 time = ktime_get_mono_fast_ns();
217202
u64 prev = ts->cur;
203+
u64 time;
218204

219205
if (!ts->timeout_end) {
220-
ts->cur = time;
221-
ts->timeout_end = time + ts->duration;
206+
if (check_deadlock_AA(lock))
207+
return -EDEADLK;
208+
ts->cur = ktime_get_mono_fast_ns();
209+
ts->timeout_end = ts->cur + ts->duration;
222210
return 0;
223211
}
224212

213+
time = ktime_get_mono_fast_ns();
225214
if (time > ts->timeout_end)
226215
return -ETIMEDOUT;
227216

@@ -231,7 +220,7 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
231220
*/
232221
if (prev + NSEC_PER_MSEC < time) {
233222
ts->cur = time;
234-
return check_deadlock(lock, mask);
223+
return check_deadlock_ABBA(lock, mask);
235224
}
236225

237226
return 0;
@@ -275,6 +264,10 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
275264
int val, ret = 0;
276265

277266
RES_INIT_TIMEOUT(ts);
267+
/*
268+
* The fast path is not invoked for the TAS fallback, so we must grab
269+
* the deadlock detection entry here.
270+
*/
278271
grab_held_lock_entry(lock);
279272

280273
/*
@@ -397,10 +390,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
397390
goto queue;
398391
}
399392

400-
/*
401-
* Grab an entry in the held locks array, to enable deadlock detection.
402-
*/
403-
grab_held_lock_entry(lock);
393+
/* Deadlock detection entry already held after failing fast path. */
404394

405395
/*
406396
* We're pending, wait for the owner to go away.
@@ -447,12 +437,21 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
447437
* queuing.
448438
*/
449439
queue:
450-
lockevent_inc(lock_slowpath);
451440
/*
452-
* Grab deadlock detection entry for the queue path.
441+
* Do not queue if we're a waiter and someone is attempting this lock on
442+
* the same CPU. In case of NMIs, this prevents long timeouts where we
443+
* interrupt the pending waiter, and the owner, that will eventually
444+
* signal the head of our queue, both of which are logically but not
445+
* physically part of the queue, hence outside the scope of the idx > 0
446+
* check above for the trylock fallback.
453447
*/
454-
grab_held_lock_entry(lock);
448+
if (check_deadlock_AA(lock)) {
449+
ret = -EDEADLK;
450+
goto err_release_entry;
451+
}
455452

453+
lockevent_inc(lock_slowpath);
454+
/* Deadlock detection entry already held after failing fast path. */
456455
node = this_cpu_ptr(&rqnodes[0].mcs);
457456
idx = node->count++;
458457
tail = encode_tail(smp_processor_id(), idx);
@@ -464,19 +463,17 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
464463
* not be nested NMIs taking spinlocks. That may not be true in
465464
* some architectures even though the chance of needing more than
466465
* 4 nodes will still be extremely unlikely. When that happens,
467-
* we fall back to spinning on the lock directly without using
468-
* any MCS node. This is not the most elegant solution, but is
469-
* simple enough.
466+
* we fall back to attempting a trylock operation without using
467+
* any MCS node. Unlike qspinlock which cannot fail, we have the
468+
* option of failing the slow path, and under contention, such a
469+
* trylock spinning will likely be treated unfairly due to lack of
470+
* queueing, hence do not spin.
470471
*/
471-
if (unlikely(idx >= _Q_MAX_NODES || in_nmi())) {
472+
if (unlikely(idx >= _Q_MAX_NODES || (in_nmi() && idx > 0))) {
472473
lockevent_inc(lock_no_node);
473-
RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
474-
while (!queued_spin_trylock(lock)) {
475-
if (RES_CHECK_TIMEOUT(ts, ret, ~0u)) {
476-
lockevent_inc(rqspinlock_lock_timeout);
477-
goto err_release_node;
478-
}
479-
cpu_relax();
474+
if (!queued_spin_trylock(lock)) {
475+
ret = -EDEADLK;
476+
goto err_release_node;
480477
}
481478
goto release;
482479
}

tools/testing/selftests/bpf/test_kmods/bpf_test_rqspinlock.c

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,16 @@ static const unsigned int rqsl_hist_ms[] = {
3333
};
3434
#define RQSL_NR_HIST_BUCKETS ARRAY_SIZE(rqsl_hist_ms)
3535

36+
enum rqsl_context {
37+
RQSL_CTX_NORMAL = 0,
38+
RQSL_CTX_NMI,
39+
RQSL_CTX_MAX,
40+
};
41+
3642
struct rqsl_cpu_hist {
37-
atomic64_t normal[RQSL_NR_HIST_BUCKETS];
38-
atomic64_t nmi[RQSL_NR_HIST_BUCKETS];
43+
atomic64_t hist[RQSL_CTX_MAX][RQSL_NR_HIST_BUCKETS];
44+
atomic64_t success[RQSL_CTX_MAX];
45+
atomic64_t failure[RQSL_CTX_MAX];
3946
};
4047

4148
static DEFINE_PER_CPU(struct rqsl_cpu_hist, rqsl_cpu_hists);
@@ -117,14 +124,18 @@ static u32 rqsl_hist_bucket_idx(u32 delta_ms)
117124
return RQSL_NR_HIST_BUCKETS - 1;
118125
}
119126

120-
static void rqsl_record_lock_time(u64 delta_ns, bool is_nmi)
127+
static void rqsl_record_lock_result(u64 delta_ns, enum rqsl_context ctx, int ret)
121128
{
122129
struct rqsl_cpu_hist *hist = this_cpu_ptr(&rqsl_cpu_hists);
123130
u32 delta_ms = DIV_ROUND_UP_ULL(delta_ns, NSEC_PER_MSEC);
124131
u32 bucket = rqsl_hist_bucket_idx(delta_ms);
125-
atomic64_t *buckets = is_nmi ? hist->nmi : hist->normal;
132+
atomic64_t *buckets = hist->hist[ctx];
126133

127134
atomic64_inc(&buckets[bucket]);
135+
if (!ret)
136+
atomic64_inc(&hist->success[ctx]);
137+
else
138+
atomic64_inc(&hist->failure[ctx]);
128139
}
129140

130141
static int rqspinlock_worker_fn(void *arg)
@@ -147,7 +158,8 @@ static int rqspinlock_worker_fn(void *arg)
147158
}
148159
start_ns = ktime_get_mono_fast_ns();
149160
ret = raw_res_spin_lock_irqsave(worker_lock, flags);
150-
rqsl_record_lock_time(ktime_get_mono_fast_ns() - start_ns, false);
161+
rqsl_record_lock_result(ktime_get_mono_fast_ns() - start_ns,
162+
RQSL_CTX_NORMAL, ret);
151163
mdelay(normal_delay);
152164
if (!ret)
153165
raw_res_spin_unlock_irqrestore(worker_lock, flags);
@@ -190,7 +202,8 @@ static void nmi_cb(struct perf_event *event, struct perf_sample_data *data,
190202
locks = rqsl_get_lock_pair(cpu);
191203
start_ns = ktime_get_mono_fast_ns();
192204
ret = raw_res_spin_lock_irqsave(locks.nmi_lock, flags);
193-
rqsl_record_lock_time(ktime_get_mono_fast_ns() - start_ns, true);
205+
rqsl_record_lock_result(ktime_get_mono_fast_ns() - start_ns,
206+
RQSL_CTX_NMI, ret);
194207

195208
mdelay(nmi_delay);
196209

@@ -300,12 +313,14 @@ static void rqsl_print_histograms(void)
300313
u64 norm_counts[RQSL_NR_HIST_BUCKETS];
301314
u64 nmi_counts[RQSL_NR_HIST_BUCKETS];
302315
u64 total_counts[RQSL_NR_HIST_BUCKETS];
316+
u64 norm_success, nmi_success, success_total;
317+
u64 norm_failure, nmi_failure, failure_total;
303318
u64 norm_total = 0, nmi_total = 0, total = 0;
304319
bool has_slow = false;
305320

306321
for (i = 0; i < RQSL_NR_HIST_BUCKETS; i++) {
307-
norm_counts[i] = atomic64_read(&hist->normal[i]);
308-
nmi_counts[i] = atomic64_read(&hist->nmi[i]);
322+
norm_counts[i] = atomic64_read(&hist->hist[RQSL_CTX_NORMAL][i]);
323+
nmi_counts[i] = atomic64_read(&hist->hist[RQSL_CTX_NMI][i]);
309324
total_counts[i] = norm_counts[i] + nmi_counts[i];
310325
norm_total += norm_counts[i];
311326
nmi_total += nmi_counts[i];
@@ -315,17 +330,33 @@ static void rqsl_print_histograms(void)
315330
has_slow = true;
316331
}
317332

333+
norm_success = atomic64_read(&hist->success[RQSL_CTX_NORMAL]);
334+
nmi_success = atomic64_read(&hist->success[RQSL_CTX_NMI]);
335+
norm_failure = atomic64_read(&hist->failure[RQSL_CTX_NORMAL]);
336+
nmi_failure = atomic64_read(&hist->failure[RQSL_CTX_NMI]);
337+
success_total = norm_success + nmi_success;
338+
failure_total = norm_failure + nmi_failure;
339+
318340
if (!total)
319341
continue;
320342

321343
if (!has_slow) {
322-
pr_err(" cpu%d: total %llu (normal %llu, nmi %llu), all within 0-%ums\n",
323-
cpu, total, norm_total, nmi_total, RQSL_SLOW_THRESHOLD_MS);
344+
pr_err(" cpu%d: total %llu (normal %llu, nmi %llu) | "
345+
"success %llu (normal %llu, nmi %llu) | "
346+
"failure %llu (normal %llu, nmi %llu), all within 0-%ums\n",
347+
cpu, total, norm_total, nmi_total,
348+
success_total, norm_success, nmi_success,
349+
failure_total, norm_failure, nmi_failure,
350+
RQSL_SLOW_THRESHOLD_MS);
324351
continue;
325352
}
326353

327-
pr_err(" cpu%d: total %llu (normal %llu, nmi %llu)\n",
328-
cpu, total, norm_total, nmi_total);
354+
pr_err(" cpu%d: total %llu (normal %llu, nmi %llu) | "
355+
"success %llu (normal %llu, nmi %llu) | "
356+
"failure %llu (normal %llu, nmi %llu)\n",
357+
cpu, total, norm_total, nmi_total,
358+
success_total, norm_success, nmi_success,
359+
failure_total, norm_failure, nmi_failure);
329360
for (i = 0; i < RQSL_NR_HIST_BUCKETS; i++) {
330361
unsigned int start_ms;
331362

0 commit comments

Comments
 (0)