Skip to content

Investigate EPOLLEXCLUSIVE #33

@sgerbino

Description

@sgerbino

Summary

Investigate using EPOLLEXCLUSIVE to reduce thundering herd overhead when multiple threads call epoll_wait() on the same epoll file descriptor.

Background

Current Implementation

Corosio's epoll scheduler allows multiple threads to call run() on the same io_context, resulting in multiple threads blocking in epoll_wait() on a shared epoll fd. This provides natural load balancing for I/O events since the kernel typically wakes only one thread per event.

However, the wakeup() mechanism writes to an eventfd to signal waiting threads when:

  • Work is posted via post() or dispatch()
  • The scheduler is stopped
  • A timer deadline changes

When the eventfd becomes readable, all threads blocked in epoll_wait() wake up simultaneously, but only one thread actually has work to do. The others acquire the mutex, find no work, and return to epoll_wait(). This is the classic thundering herd problem.

Current Mitigation

The existing implementation accepts this overhead because:

  1. Thundering herd only occurs on explicit wakeup() calls, not on every I/O event
  2. The mutex ensures correct behavior (only one thread processes work)
  3. Modern kernels handle spurious wakeups efficiently

However, in high-throughput scenarios with frequent post() calls and many worker threads, this can cause measurable overhead from:

  • Context switches for all threads
  • Cache line contention on the mutex
  • Increased CPU utilization from spurious wakeups

Proposed Solution: EPOLLEXCLUSIVE

Linux 4.5+ introduced EPOLLEXCLUSIVE, a flag that changes wakeup behavior:

struct epoll_event ev;
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
ev.data.fd = event_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &ev);

When EPOLLEXCLUSIVE is set:

  • The kernel wakes only one thread blocked in epoll_wait() for that fd
  • If multiple fds become ready, different threads may be woken for different fds
  • Round-robin or LIFO wakeup policy (implementation-defined)

Application to Corosio

The eventfd used for wakeup signaling is registered at scheduler.cpp:94:

ev.events = EPOLLIN;
ev.data.ptr = nullptr;
if (::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev) == -1) {
    // error handling
}

Adding EPOLLEXCLUSIVE here would ensure only one thread wakes on each wakeup() call:

ev.events = EPOLLIN | EPOLLEXCLUSIVE;

Technical Considerations

Kernel Version Detection

EPOLLEXCLUSIVE requires Linux 4.5+. Options for detection:

  1. Compile-time: Check for EPOLLEXCLUSIVE macro definition
  2. Runtime: Attempt registration and fall back on EINVAL
#ifdef EPOLLEXCLUSIVE
    ev.events = EPOLLIN | EPOLLEXCLUSIVE;
#else
    ev.events = EPOLLIN;
#endif

Or with runtime fallback:

ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev) == -1) {
    if (errno == EINVAL) {
        // Fallback for older kernels
        ev.events = EPOLLIN;
        ::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev);
    }
}

Socket Accept Operations

EPOLLEXCLUSIVE is also relevant for accept operations on listening sockets (sockets.hpp:462). When multiple threads wait to accept on the same socket, EPOLLEXCLUSIVE prevents all threads from waking on each incoming connection.

However, this requires careful consideration:

  • Socket registration currently uses edge-triggered mode (EPOLLIN | EPOLLET)
  • EPOLLEXCLUSIVE combined with EPOLLET has specific semantics
  • Need to verify correct behavior with the one-shot unregister pattern

Interaction with Edge-Triggered Mode

The current implementation uses EPOLLET (edge-triggered) for all socket operations. When combining EPOLLEXCLUSIVE with EPOLLET:

  • Wakeup occurs on edge (transition to ready state)
  • Only one thread receives the notification
  • If that thread doesn't fully drain the fd, subsequent data won't trigger another wakeup until the fd returns to non-ready state

This should be compatible with Corosio's one-shot pattern where fds are unregistered immediately after epoll_wait() returns.

Level-Triggered Eventfd

The eventfd used for wakeup is currently level-triggered (no EPOLLET). With EPOLLEXCLUSIVE:

  • One thread wakes per epoll_wait() cycle
  • If multiple wakeup() calls occur, the accumulated value is read once
  • This matches desired behavior (wake one thread to process queue)

Benchmarking Strategy

To measure the impact, create a benchmark that:

  1. Spawns N worker threads calling io_context::run()
  2. Has a producer thread calling post() at high frequency
  3. Measures:
    • Total throughput (posts/second)
    • CPU utilization
    • Context switch rate
    • Latency distribution

Compare results with and without EPOLLEXCLUSIVE.

Compatibility

Requirement Version
Linux Kernel 4.5+
glibc 2.24+
musl 1.1.18+

For older systems, the library should gracefully fall back to standard behavior.

Alternatives Considered

1. Single-Threaded Wakeup Consumer

Designate one thread as the "wakeup handler" that distributes work to others. Rejected because:

  • Adds complexity
  • Creates a bottleneck
  • Doesn't leverage kernel-level load balancing

2. Per-Thread Eventfds

Give each thread its own eventfd and wake threads round-robin. Rejected because:

  • Requires tracking which threads are blocked
  • Adds memory overhead (one eventfd per thread)
  • Complicates the scheduler implementation

3. Condition Variable Signaling

Replace eventfd with pthread condition variables for wakeup. Rejected because:

  • Requires restructuring the event loop
  • Loses the unified epoll-based wait
  • May not integrate well with timer handling

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions