You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
iree/async/: Proactor-based async I/O and causal frontier scheduling (#23527)
For the humans, from the human: this is a few weeks of deep feature
branch work with a team of @claude's. It's gone through dozens of review
cycles with mixed model teams and had quite a bit of stress testing. The
design is convergent with subsequent work on the iree/net/ layer (which
is built on top of this) as well as the remote HAL (which uses both).
The future AMDGPU backend (and eventually all HAL drivers) will be
natively built on iree/async/ for all their internal scheduling,
enabling us to do distributed async execution of heterogeneous
CPU/NVMe/NIC/NPU/GPU workloads. The existing iree/task/ system will be
rebased on this soonish to replace its current polling infrastructure,
and iree_loop_t will be upgraded to integrate better for wait
operations. For now, this is a complete foundation across our platforms
of interest and enough to unblock the AMDGPU and remote HAL efforts.
---
This PR introduces `iree/async/`, a completion-based (proactor pattern)
async I/O layer that serves as the foundation for IREE's networking,
storage, and distributed scheduling. It depends only on `iree/base/` and
provides the substrate that HAL drivers, networking, task executors, and
the VM runtime will build on.
### Why
ML inference at scale moves large tensors between GPUs, across networks,
and through storage with latency budgets measured in microseconds. The
data that matters — model weights, activations, KV caches — lives in GPU
VRAM. Moving it between machines for distributed inference or to NVMe
for checkpointing should not require the CPU to touch every byte. Modern
hardware can already do this: NICs read directly from GPU VRAM
(GPUDirect RDMA), NVMe controllers write to GPU memory (GPU Direct
Storage), and GPUs access each other's memory across PCIe or NVLink. The
software layer's job is to orchestrate these transfers, not participate
in them.
The existing approach of layering vendor-specific libraries (NCCL, RCCL)
with traditional reactor I/O (select/poll/epoll + read/write) cannot
express the pipelines we need. A reactor tells you "this fd is ready"
and then you make a separate syscall to do the I/O — every transition
costs two syscalls and a copy through kernel buffers. You cannot express
"wait for GPU completion, then send the result over TCP, then write to
NVMe" as a single atomic submission. Layering multiple runtime systems
means multiple threading models, multiple synchronization primitives,
and multiple memory management systems — each adding latency at every
boundary.
A completion-based proactor that handles all I/O through one
submission/completion interface eliminates these boundaries. On
io_uring, an entire pipeline — GPU fence wait, network send from
registered GPU memory, disk write — can execute as linked SQEs in kernel
space with zero userspace transitions between steps. The proactor is not
in the data path; hardware and kernel handle the transfers directly.
### Causal frontier scheduling
Beyond the I/O layer, this PR introduces a causal dependency tracking
system based on vector clock frontiers. Timeline semaphores (the bridge
between GPU queues and async I/O) carry frontier metadata: sparse
vectors of `(axis, epoch)` pairs where each axis identifies a causal
source (a GPU queue, a collective operation, a host thread) and each
epoch marks a position on that timeline.
When a GPU queue signals a semaphore, the signal carries the queue's
current frontier — a compact summary of everything that happened before
the signal. When another queue waits on that semaphore, it inherits the
frontier through a merge (component-wise maximum). Causal knowledge
propagates transitively: if queue C waits on a semaphore signaled by
queue B, which previously waited on a semaphore signaled by queue A,
queue C's frontier reflects A's work without any direct interaction.
This enables three capabilities that binary events and standalone
timeline semaphores cannot provide:
**Wait elision**: When a queue's local frontier already dominates an
operation's dependency frontier, the device wait is skipped entirely.
Sequential single-queue workloads pay zero synchronization cost — every
wait is elided because the queue's own epoch already implies all
prerequisites.
**O(1) buffer reuse**: When a buffer is freed, the deallocating queue's
current frontier becomes the buffer's death frontier. Another queue can
safely reuse the buffer by checking frontier dominance — one comparison
instead of tracking every operation that touched the buffer. A weight
tensor read by hundreds of operations has one death frontier, not
hundreds of per-operation reference counts.
**Remote pipeline scheduling**: A remote machine receiving a frontier
can locally determine whether prerequisites are satisfied across all
contributing queues — including queues on other machines it has never
communicated with directly — without round-trips to the originating
devices. Entire multi-stage, multi-device pipelines can be submitted
atomically before any work begins, and hardware FIFO ordering ensures
correct execution.
Collective operations (all-reduce, all-gather) compress N device axes
into a single collective channel axis, so tensor parallelism across 8
GPUs costs one frontier entry regardless of device count.
The [async scheduling design
docs](docs/website/docs/developers/design-docs/async-scheduling/)
include an interactive visualizer that renders DAGs, frontier
propagation, and semaphore state across configurable scenarios — from
laptop (3 concurrent models) to datacenter (multi-node MI300X cluster
with RDMA) — with step-through execution showing exactly how frontiers
flow through pipelines.
<img width="1192" height="1986" alt="image"
src="https://github.com/user-attachments/assets/88366b2e-aca1-4c2a-8f90-7e4afea1f4c9"
/>
### What's here
**Core API** (`proactor.h`, `operation.h`, `semaphore.h`, `frontier.h`):
The proactor manages async operation submission and completion dispatch
through a vtable-dispatched interface. Operations are caller-owned,
intrusive structs — no proactor allocation on submit. Semaphores provide
cross-layer timeline synchronization with frontier-carrying signals. All
operations carry status with rich annotations and stack traces; there
are no silent failures.
**Operation types**: Sockets (TCP/UDP/Unix, with accept, connect, recv,
send, sendto, recvfrom, close), files (positioned pread/pwrite with
open, read, write, close), events (cross-thread signaling),
notifications (level-triggered epoch-based wakeup), timers, semaphore
wait/signal, futex wait/wake, sequences (linked operation chains), and
cross-proactor messages. Operations support multishot delivery
(persistent accept/recv) and linked chaining (kernel-side sequences on
io_uring, callback-emulated elsewhere).
**Sockets** (`socket.h`): Immutable configuration at creation
(REUSE_ADDR, REUSE_PORT, NO_DELAY, KEEPALIVE, ZERO_COPY), then
bind/listen synchronously, then all I/O is async. Imported sockets from
existing file descriptors. Sticky failure state — once a socket
encounters an error, subsequent operations complete immediately with the
recorded failure.
**Memory registration** (`region.h`, `span.h`, `slab.h`): Registered
memory regions for zero-copy I/O. Buffer registration pins memory and
pre-computes backend handles so I/O operations reference memory by
handle rather than re-mapping on every operation. Scatter-gather spans
are non-owning value types; the proactor retains regions for in-flight
operations automatically. Slab registration for fixed-size slot
allocation with io_uring provided buffer ring integration.
**Relays** (`relay.h`): Declarative source-to-sink event dataflow.
Connect a readable fd or notification epoch advance to an eventfd write
or notification signal. On io_uring, certain source/sink combinations
execute entirely in kernel space via linked SQEs.
**Device fence bridging**: Import sync_file fds from GPU drivers to
advance async semaphores when GPU work completes. Export semaphore
values as sync_file fds for GPU command buffers to wait on. The proactor
bridges between kernel device synchronization and the async scheduling
system, enabling ahead-of-time pipeline construction across GPU and I/O
boundaries.
**Signal handling**: Process-wide signal subscription through the
proactor — signalfd on Linux, self-pipe on other POSIX platforms.
SIGINT, SIGTERM, SIGHUP, SIGQUIT, SIGUSR1, SIGUSR2 dispatched as
callbacks from within poll().
### Platform backends
**io_uring** (Linux 5.1+): The primary production backend. Direct
syscalls, no liburing dependency. Exploits fixed files and registered
buffers (avoid per-op fd lookup and page pinning), provided buffer rings
for kernel-selected multishot recv buffers, linked SQEs for
zero-round-trip operation sequences, zero-copy send (SEND_ZC), MSG_RING
for cross-proactor messaging, futex ops (6.7+) for kernel-side semaphore
waits in link chains, and sync_file fd polling for device fence import.
Submit fills SQEs under a spinlock from any thread; io_uring_enter is
called only from the poll thread (SINGLE_ISSUER).
**POSIX** (Linux epoll, macOS/BSD kqueue, fallback poll()):
Broad-coverage backend with pluggable event notification. Emulates
linked operations, multishot, and other io_uring features with per-step
poll round-trips — functionally equivalent API, same behavioral
contract, higher per-step latency. The proactive scheduling API costs
nothing extra on POSIX while enabling zero-round-trip execution on
io_uring. Platform-default selection: epoll on Linux, kqueue on
macOS/BSD, poll() elsewhere.
**IOCP** (Windows): I/O Completion Ports backend. Closer in behavior to
io_uring than the POSIX backend — completion-based rather than
readiness-based. Socket operations, timer queue, and the full operation
type set.
All backends report capabilities at runtime (`query_capabilities()`).
Callers discover what's available — multishot, fixed files, registered
buffers, linked operations, zero-copy send, dmabuf, device fences,
absolute timeouts, futex operations, cross-proactor messaging — and
adapt their code paths. "Emulated" in the capability matrix means the
API works but uses a software fallback rather than a kernel-optimized
path.
### Testing
A conformance test suite (CTS) validates all backends against shared
test suites. Tests are written once and run against every registered
backend configuration — 5 io_uring configurations with different
capability masks, plus per-platform and per-feature POSIX
configurations, plus IOCP. Tag-based filtering ensures tests only run
against backends that support the features they exercise.
Test suites cover core operations, socket I/O (TCP, UDP, Unix,
multishot, zero-copy), file I/O, events, notifications, semaphores
(async/sync/linked), relays, fences, cancellation, error propagation,
and resource exhaustion. Benchmarks measure dispatch scalability,
sequence overhead, relay fan-out, socket throughput, and event pool
performance.
### Thread safety model
The proactor's event loop is caller-driven: `poll()` has single-thread
ownership, callbacks fire on the poll thread. `submit()`, `cancel()`,
`wake()`, and `send_message()` are thread-safe from any thread.
Semaphore signal/query and event set are thread-safe. Notification
signal is both thread-safe and async-signal-safe. A utility wrapper
(`proactor_thread.h`) provides optional dedicated-thread operation for
applications that want it.
### Design docs
- [`runtime/src/iree/async/README.md`](runtime/src/iree/async/README.md)
— full API documentation with architecture diagrams, ownership rules,
code examples, and the capability matrix
-
[`docs/.../async-scheduling/`](docs/website/docs/developers/design-docs/async-scheduling/)
— causal frontier design document with interactive visualizer,
multi-device scheduling scenarios (laptop through datacenter), and
comparison with binary events and standalone timeline semaphores
---------
Co-authored-by: Claude <noreply@anthropic.com>
0 commit comments