Skip to content

Commit 412895f

Browse files
Roman Penyaevtorvalds
authored andcommitted
epoll: atomically remove wait entry on wake up
This patch does two things: - fixes a lost wakeup introduced by commit 339ddb5 ("fs/epoll: remove unnecessary wakeups of nested epoll") - improves performance for events delivery. The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function. This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905) The idea of the current patch is to use init_wait() instead of init_waitqueue_entry(). Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups. Problem is very well reproduced by the epoll60 test case [1]. Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb5): real 0m6.970s user 0m49.786s sys 0m0.113s After this patch: real 0m5.220s user 0m36.879s sys 0m0.019s The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb5): threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309 After this patch: threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767 (number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better) [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Signed-off-by: Roman Penyaev <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jason Baron <[email protected]> Cc: Khazhismel Kumykov <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Heiher <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
1 parent 474328c commit 412895f

File tree

1 file changed

+24
-19
lines changed

1 file changed

+24
-19
lines changed

fs/eventpoll.c

Lines changed: 24 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1822,7 +1822,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
18221822
{
18231823
int res = 0, eavail, timed_out = 0;
18241824
u64 slack = 0;
1825-
bool waiter = false;
18261825
wait_queue_entry_t wait;
18271826
ktime_t expires, *to = NULL;
18281827

@@ -1867,21 +1866,23 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
18671866
*/
18681867
ep_reset_busy_poll_napi_id(ep);
18691868

1870-
/*
1871-
* We don't have any available event to return to the caller. We need
1872-
* to sleep here, and we will be woken by ep_poll_callback() when events
1873-
* become available.
1874-
*/
1875-
if (!waiter) {
1876-
waiter = true;
1877-
init_waitqueue_entry(&wait, current);
1878-
1869+
do {
1870+
/*
1871+
* Internally init_wait() uses autoremove_wake_function(),
1872+
* thus wait entry is removed from the wait queue on each
1873+
* wakeup. Why it is important? In case of several waiters
1874+
* each new wakeup will hit the next waiter, giving it the
1875+
* chance to harvest new event. Otherwise wakeup can be
1876+
* lost. This is also good performance-wise, because on
1877+
* normal wakeup path no need to call __remove_wait_queue()
1878+
* explicitly, thus ep->lock is not taken, which halts the
1879+
* event delivery.
1880+
*/
1881+
init_wait(&wait);
18791882
write_lock_irq(&ep->lock);
18801883
__add_wait_queue_exclusive(&ep->wq, &wait);
18811884
write_unlock_irq(&ep->lock);
1882-
}
18831885

1884-
for (;;) {
18851886
/*
18861887
* We don't want to sleep if the ep_poll_callback() sends us
18871888
* a wakeup in between. That's why we set the task state
@@ -1911,10 +1912,20 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
19111912
timed_out = 1;
19121913
break;
19131914
}
1914-
}
1915+
1916+
/* We were woken up, thus go and try to harvest some events */
1917+
eavail = 1;
1918+
1919+
} while (0);
19151920

19161921
__set_current_state(TASK_RUNNING);
19171922

1923+
if (!list_empty_careful(&wait.entry)) {
1924+
write_lock_irq(&ep->lock);
1925+
__remove_wait_queue(&ep->wq, &wait);
1926+
write_unlock_irq(&ep->lock);
1927+
}
1928+
19181929
send_events:
19191930
/*
19201931
* Try to transfer events to user space. In case we get 0 events and
@@ -1925,12 +1936,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
19251936
!(res = ep_send_events(ep, events, maxevents)) && !timed_out)
19261937
goto fetch_events;
19271938

1928-
if (waiter) {
1929-
write_lock_irq(&ep->lock);
1930-
__remove_wait_queue(&ep->wq, &wait);
1931-
write_unlock_irq(&ep->lock);
1932-
}
1933-
19341939
return res;
19351940
}
19361941

0 commit comments

Comments
 (0)