Use io_uring for sockets on Linux by benaadams · Pull Request #124374 · dotnet/runtime

benaadams · 2026-02-13T11:18:10Z

Contributes to #753

Summary

This document describes the complete, production-grade io_uring socket I/O engine in .NET's System.Net.Sockets layer.

When enabled via DOTNET_SYSTEM_NET_SOCKETS_IO_URING=1 on Linux kernel 6.1+, the engine replaces epoll with a managed io_uring completion-mode backend that:

Directly writes SQEs to mmap'd kernel ring buffers from C#
Processes CQEs inline on the event loop thread
Supports multishot accept, multishot recv with provided buffer rings, zero-copy send (SEND_ZC/SENDMSG_ZC), registered files, registered buffers, adaptive buffer sizing, and SQPOLL kernel-side submission polling
Recovers safely from CQ overflow across three discriminated branches
Sweeps stale tracked operations after CQ overflow recovery via a delayed-deadline mechanism

The native shim is intentionally minimal - 433 lines of C wrapping the three io_uring syscalls (setup, enter, register) plus eventfd and mmap helpers. All ring management, SQE construction, CQE dispatch, operation lifecycle, feature negotiation, overflow recovery, and SQPOLL wakeup detection lives in managed code.

The engine proper is organized as eight partial class files extending SocketAsyncEngine: the main file (SocketAsyncEngine.Linux.cs, 3848 lines) holds ring setup, flag negotiation, CQE drain, SQE prep orchestration, completion slot layout, and the event loop; the remaining seven partials handle ring mmap lifecycle (IoUringRings, 343 lines), completion slot pool management (IoUringSlots, 437 lines), SQE writing (IoUringSqeWriters, 327 lines), completion dispatch (IoUringCompletionDispatch, 668 lines), diagnostics logging (IoUringDiagnostics, 324 lines), configuration resolution (IoUringConfiguration, 128 lines), and debug test hooks (IoUringTestHooks, 214 lines). A separate IoUringTestAccessors.Linux.cs file (938 lines) exposes all test-observable state through strongly-typed accessors. Tests access this surface through InternalTestShims.Linux.cs (644 lines), a centralized reflection shim with [DynamicDependency] annotations for trimmer/AOT safety.

Key metrics:

Metric	Value
Partial class files (SocketAsyncEngine)	9 (main + 8 partials)
New managed source lines (socket layer)	~11,400
Native shim lines	~433 (C) + 27 (header)
New tests	~132 (ConditionalFact/ConditionalTheory in IoUring.Unix.cs)
Test lines	~6,665 (IoUring.Unix.cs) + 644 (InternalTestShims)
Breaking API changes	0 - purely additive, behind opt-in env var

2. Architecture

Ring Ownership and Event Loop

The architecture follows the SINGLE_ISSUER contract: exactly one thread - the event loop thread - owns the io_uring instance. All ring mutations (SQE writes, CQ head advances, io_uring_enter calls) happen on this thread. Other threads communicate via two MPSC queues.

graph TD
    WT[Worker Threads] -->|"MpscQueue&lt;IoUringPrepareWorkItem&gt;"| EL[Event Loop Thread]
    WT -->|"MpscQueue&lt;ulong&gt; (cancel)"| EL
    WT -->|"eventfd write (wake)"| EL
    EL -->|"Writes SQEs / Drains CQEs / io_uring_enter"| K[Kernel - io_uring]
    K -->|"CQE completions"| EL
    EL -->|"ThreadPool.QueueUserWorkItem"| TP[ThreadPool]

The Thin Native Shim Approach

The native shim (pal_io_uring_shim.c, 433 lines) wraps exactly:

io_uring_setup (via syscall(__NR_io_uring_setup, ...) with SYS_io_uring_setup fallback)
io_uring_enter (with and without EXT_ARG)
io_uring_register
mmap / munmap (for ring mapping)
eventfd / read / write (for cross-thread wakeup; EINTR-looped)
uname (for kernel version detection)

All ring pointer arithmetic, SQE field population, CQE parsing, SQPOLL wakeup detection (via Volatile.Read on the mmap'd SQ flags word), overflow recovery, and operation lifecycle management happens in managed C#. This is deliberate:

Managed code is easier to debug, profile, and modify. The JIT can inline hot paths. No P/Invoke on the SQE write path.
The shim compiles on any Linux with <linux/io_uring.h> - no liburing dependency.
Feature negotiation (flag peeling, opcode probing) is entirely managed and testable.
Requires exact ABI-level knowledge of kernel structs (mitigated by _Static_assert(IORING_SETUP_CLOEXEC == (1U << 19), ...) in the shim and layout contract tests in C#).

Threading Model

The event loop thread owns:

The io_uring ring fd and all mmap'd ring pointers
All SQE writes and CQ drains
The _completionSlots[] / _completionSlotStorage[] arrays
Eventfd registered-file entry management
Adaptive buffer sizing evaluation
SQPOLL idle detection via SQ_NEED_WAKEUP on the mmap'd SQ flags pointer
CQ overflow recovery state machine

Worker threads interact solely through:

TryEnqueueIoUringPreparation() -> MPSC prepare queue -> eventfd write
TryRequestIoUringCancellation() -> MPSC cancel queue -> eventfd write
Volatile.Read on _ioUringTeardownInitiated to avoid publishing work after shutdown

Partial Class File Organization

File	Lines	Responsibility
`SocketAsyncEngine.Linux.cs`	3,848	Core: ring setup, flag negotiation, CQE drain loop, SQE prep orchestration, event loop hooks, completion slot lifetime, tracked operation management, overflow recovery, SQPOLL wakeup, queue management, feature resolution
`SocketAsyncEngine.IoUringSlots.Linux.cs`	437	SoA completion slot allocation, free-list management, native per-slot slab layout, message header inline copy/writeback, zero-copy pin hold transfer, slot encode/decode
`SocketAsyncEngine.IoUringRings.Linux.cs`	343	`TryMmapRings`: maps SQ/CQ/SQE regions, validates mmap offset bounds, derives all ring pointers. `CleanupManagedRings`: multi-step teardown. `LinuxFreeIoUringResources`: full teardown orchestration
`SocketAsyncEngine.IoUringSqeWriters.Linux.cs`	327	All `Write*Sqe` methods: send, sendZc, recv, readFixed, providedBufferRecv, multishotRecv, accept, multishotAccept, sendMsg, sendMsgZc, recvMsg, connect, asyncCancel. Deduplicated via `WriteSendLikeSqe` and `WriteSendMsgLikeSqe`
`SocketAsyncEngine.IoUringCompletionDispatch.Linux.cs`	668	`SocketEventHandler` partial: `DispatchSingleIoUringCompletion`, `DispatchMultishotIoUringCompletion`, `DispatchZeroCopyIoUringNotification`, multishot accept/recv dispatch, buffer materialization, completion result routing
`SocketAsyncEngine.IoUringDiagnostics.Linux.cs`	324	Structured `NetEventSource.Info/Error` log helpers for all io_uring events: async-cancel failures, queue overflows, CQ overflow entry/completion with branch discriminator, deferred rearm nudge, teardown summary, advanced feature state
`SocketAsyncEngine.IoUringConfiguration.Linux.cs`	128	`IsIoUringEnabled`, `IsSqPollRequested`, `IsZeroCopySendOptedIn`, `IsIoUringDirectSqeDisabled` with `[FeatureSwitchDefinition]` annotations for JIT-eliminable code paths
`SocketAsyncEngine.IoUringTestHooks.Linux.cs`	214	`#if DEBUG`-gated EAGAIN/ECANCELED forced result injection, per-opcode mask parsing from environment, result application/resolution/restoration
`SocketAsyncEngine.IoUringTestAccessors.Linux.cs`	938	Strongly-typed snapshot structs and accessor methods for all testable engine state

Submission Path: Standard vs. SQPOLL

In standard mode, io_uring_enter submits pending SQEs and optionally waits for CQEs. In SQPOLL mode, a kernel thread continuously polls the SQ ring. Managed code detects idle via Volatile.Read on the mmap'd _managedSqFlagsPtr checking for IORING_SQ_NEED_WAKEUP. When the kernel thread is awake, no io_uring_enter is needed for submission.

Flag Negotiation (Peel Loop)

Setup builds an initial flag set: CQSIZE | SUBMIT_ALL | COOP_TASKRUN | SINGLE_ISSUER | NO_SQARRAY | CLOEXEC. SQPOLL (mutually exclusive with DEFER_TASKRUN) or DEFER_TASKRUN is added based on configuration. On EINVAL, flags are peeled in order: NO_SQARRAY first, then CLOEXEC. EPERM is never retried (respects seccomp/kernel policy). After setup, FD_CLOEXEC is set as a fallback via fcntl for kernels where IORING_SETUP_CLOEXEC was peeled.

CQ Overflow Recovery State Machine

CQ overflow is detected on every DrainCqeRingBatch entry via ObserveManagedCqOverflowCounter, which compares the mmap'd overflow counter against the last-observed value using wrapping uint32 delta arithmetic. When a delta is seen, the engine enters a three-branch recovery state machine:

MultishotAcceptArming: Active when _liveAcceptCompletionSlotCount > 0 and not in teardown. Defers multishot accept re-arm nudges until post-drain.
Teardown: Active when _ioUringTeardownInitiated is set. Teardown owns recovery completion.
DualWave: Steady-state branch for all other overflow scenarios, including escalation when new overflow occurs during existing recovery.

During overflow recovery, CQ head advances happen per-CQE (not batched) to relieve kernel pressure immediately. Recovery completes when the CQ ring is fully drained and no new overflow delta is observed. On completion: AssertCompletionSlotPoolConsistency validates free-list integrity, telemetry is incremented, and for the MultishotAcceptArming branch, TryQueueDeferredMultishotAcceptRearmAfterRecovery nudges accept contexts.

After recovery completes, a delayed sweep (TrySweepStaleTrackedIoUringOperationsAfterCqOverflowRecovery) fires 250ms later to retire tracked operations whose CQEs were dropped. The sweep skips intentionally long-lived multishot accept and persistent multishot recv slots. Operations still in the waiting state are canceled; already-transitioned operations are detached and their slots freed.

3. Key Data Structures

Completion Slot Pool

Three parallel SoA arrays, all indexed by slot index:

IoUringCompletionSlot[] (hot, 32 bytes each, [StructLayout(LayoutKind.Explicit, Size = 32)]):
- Offset 0: Generation (ulong) - 43-bit generation field
- Offset 8: FreeListNext (int) - intrusive free list, -1 = end
- Offset 12: _packedState (uint) - IoUringCompletionOperationKind in low 8 bits, boolean flags IsZeroCopySend/ZeroCopyNotificationPending/UsesFixedRecvBuffer in bits 8-10
- Offset 16: FixedRecvBufferId (ushort)
- Offset 24 (#if DEBUG only): TestForcedResult (int)
- Layout gives exactly 2 slots per 64-byte cache line with zero split-line access
IoUringCompletionSlotStorage[] (cold): Per-slot tracked operation reference (TrackedOperation, TrackedOperationGeneration), DangerousRefSocketHandle for fd lifetime, pre-allocated native inline storage slab (NativeMsghdr + 4 IOVectors + 128B socket addr + 128B control + socklen_t), message writeback pointers for recvmsg.
MemoryHandle[] (zero-copy pin holds): One System.Buffers.MemoryHandle per slot index, holding the pin for SEND_ZC payloads until the NOTIF CQE arrives.

Layout contract tests verify IoUringCompletionSlot field offsets and the 32-byte total size via reflection on every test run. A Debug.Assert in InitializeCompletionSlotPool fires if the size drifts.

Generation Encoding

13-bit slot index (SlotIndexBits = 13, capacity 8192) and 43-bit generation (GenerationBits = 56 - 13 = 43, GenerationMask = (1UL << 43) - 1UL) packed into the 56-bit user_data payload. The upper 8 bits of user_data carry a tag byte (2 = reserved completion, 3 = wakeup signal). Generation is initialized to 1 (not 0) so stale CQEs referencing generation 0 are rejected. On wrap, generation remaps from 2^43-1 back to 1, skipping zero.

IoUringCompletionOperationKind

A 3-variant enum (None, Accept, Message) stored in the packed state of each IoUringCompletionSlot. This determines per-completion post-processing behavior: accept completions read sockaddr length from the native slab; message completions copy writeback data from the native msghdr.

IoUringCompletionDispatchKind

A 10-variant enum (Default, ReadOperation, WriteOperation, SendOperation, BufferListSendOperation, BufferMemoryReceiveOperation, BufferListReceiveOperation, ReceiveMessageFromOperation, AcceptOperation, ConnectOperation) stored as a packed integer inside each AsyncOperation, set at operation creation time and consumed at CQE dispatch to route completions without virtual dispatch. Defined in the shared Unix partial class (SocketAsyncContext.Unix.cs) so it compiles on all Unix TFMs.

MPSC Queue

MpscQueue<T> is a lock-free segmented queue with cache-line-padded head/tail pointers and an EnqueueIndex counter per segment. Features:

Platform-aware cache line padding: 128-byte on ARM64/LoongArch64, 64-byte otherwise
4-slot unlinked segment cache (guarded by a small Lock) to reduce allocation pressure during burst enqueue patterns
Segment recycling limited to segments that lost the tail-link CAS race (never previously published), avoiding need for producer quiescence tracking
Fast path (TryEnqueueFast/TryDequeueFast) inlined for the common non-full/non-empty case
IsEmpty property is snapshot-based, not linearizable - a return of true can mean an enqueue is mid-flight

Provided Buffer Ring

IoUringProvidedBufferRing (1,013 lines): Kernel-registered buffer pool for recv operations. Features:

Registered with kernel via IORING_REGISTER_PBUF_RING
Thread-affinity enforced via Debug.Assert(IsCurrentThreadEventLoopThread()) on resize evaluation
Deferred recycle publish: BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop to batch PublishTail calls
Adaptive sizing (default OFF): runtime adjustment of buffer size based on utilization via EvaluateProvidedBufferRingResize, gated by System.Net.Sockets.IoUringAdaptiveBufferSizing AppContext switch
Hot-swap resize: creates a new ring with an alternating group ID (1 or 2), registers it, unregisters the old one, and disposes it
Resize quiescence check: requires InUseCount == 0 and _trackedIoUringOperationCount == 0 before swap
Registered buffer support: IORING_REGISTER_BUFFERS for fixed-buffer recv via READ_FIXED opcode

LinuxIoUringCapabilities

An immutable readonly struct snapshot captured after ring setup and stored as _ioUringCapabilities. Exposes IsIoUringPort, Mode, SupportsMultishotRecv, SupportsMultishotAccept, SupportsZeroCopySend, SqPollEnabled, SupportsProvidedBufferRings, and HasRegisteredBuffers. Eliminates scattered per-capability flag reads; the entire capability set is decided once at initialization and updated only for provided-buffer state changes.

IoUringResolvedConfiguration

An immutable readonly struct capturing all resolved configuration inputs at startup: IoUringEnabled, SqPollRequested, DirectSqeDisabled, ZeroCopySendOptedIn, RegisterBuffersEnabled, AdaptiveProvidedBufferSizingEnabled, ProvidedBufferSize, PrepareQueueCapacity, CancellationQueueCapacity. Logged once via SocketsTelemetry.Log.ReportIoUringResolvedConfiguration and NetEventSource.Info.

4. Feature Inventory

Complete Feature Stack

Ring initialization with progressive flag negotiation (SQPOLL -> NO_SQARRAY -> CLOEXEC fallback via fcntl)
Managed ring mmap - SQ ring, CQ ring, and SQE array mapped directly into managed address space; SINGLE_MMAP feature detected for combined SQ/CQ mapping
Direct SQE writes from C# - no P/Invoke for SQE construction; managed code writes to IoUringSqe* pointers via mmap'd ring
Managed CQE drain - reads completions directly from mmap'd CQ ring with batched head-advance (deferred until drain completes, except during overflow recovery)
Completion mode - all socket operations submitted as io_uring ops, not epoll readiness
Multishot accept (kernel 5.19+) - single SQE arms persistent accept; multishot accept state tracked via _multishotAcceptState (0=disarmed, 1=arming, otherwise encoded user_data)
Multishot recv (kernel 6.0+) - persistent recv with provided buffer selection, early-data buffering via _persistentMultishotRecvDataQueue
Provided buffer rings - kernel-managed buffer pool for recv, with deferred recycle publish batching
Adaptive buffer sizing - runtime adjustment of provided buffer size based on utilization (defaults to OFF)
Registered buffers (IORING_REGISTER_BUFFERS) - pre-registered I/O vectors for fixed-buffer recv
Fixed-buffer recv (READ_FIXED) - kernel reads directly into registered buffers
Zero-copy send (SEND_ZC, kernel 6.0+) - avoids kernel buffer copies for large payloads (>16KB)
Zero-copy sendmsg (SENDMSG_ZC, kernel 6.1+) - zero-copy for vectored/message sends
Registered files - file descriptor table registration (used for eventfd)
Registered ring fd (IORING_REGISTER_RING_FD) - eliminates fget/fput on io_uring_enter itself
DEFER_TASKRUN - completions processed on the event loop thread, improving cache locality
SINGLE_ISSUER - kernel optimization for single-threaded submission
SQPOLL (kernel 5.11+, unprivileged 5.12+) - kernel-side submission thread polls the SQ ring; mutually exclusive with DEFER_TASKRUN; requires dual opt-in (AppContext [FeatureSwitchDefinition] + env var); JIT-eliminable when switch is false
EXT_ARG bounded wait - 50ms timeout on io_uring_enter for responsive event loops
Eventfd cross-thread wakeup - MPSC queues + eventfd for thread-safe operation submission
ASYNC_CANCEL - kernel-level cancellation of in-flight operations
Opcode probing (IORING_REGISTER_PROBE) - runtime feature detection per opcode
Completion slot pool - SoA arrays with 32-byte explicit layout, generation-based ABA protection
43-bit generation field - ~8.8 trillion incarnations per slot before wrap
Precomputed dispatch kind - IoUringCompletionDispatchKind eliminates virtual dispatch on the CQE hot path
CLOEXEC ring fd - IORING_SETUP_CLOEXEC flag with static assert in shim; fcntl fallback; dedicated test
CQ overflow recovery - three-branch state machine with post-recovery stale tracked operation sweep
Test hook injection - forced EAGAIN/ECANCELED results (gated behind #if DEBUG), per-opcode mask
Thread-affinity assertions - [Conditional("DEBUG")] AssertSingleThreadAccess at CQE dispatch entry points; mmap offset bounds validation
Comprehensive telemetry - 10 stable PollingCounters + 17 diagnostic backing fields + structured logging

5. Configuration Surface

Production Environment Variables

Variable	Values	Default	Purpose
`DOTNET_SYSTEM_NET_SOCKETS_IO_URING`	`"1"` to enable	Disabled	Master enable switch
`DOTNET_SYSTEM_NET_SOCKETS_IO_URING_SQPOLL`	`"1"` to enable	Disabled	SQPOLL kernel-side polling (also requires AppContext switch)

Production AppContext Switches

Switch Name	Type	Default	Purpose
`System.Net.Sockets.UseIoUring`	Boolean	`false`	Master enable switch (`[FeatureSwitchDefinition]`)
`System.Net.Sockets.UseIoUringSqPoll`	Boolean	`false`	SQPOLL dual opt-in (`[FeatureSwitchDefinition]` enables JIT elimination)
`System.Net.Sockets.IoUringAdaptiveBufferSizing`	Boolean	`false`	Adaptive provided-buffer ring sizing

Precedence: Environment variable wins over AppContext switch for the master gate. SQPOLL requires both surfaces enabled (dual opt-in).

SQPOLL dual opt-in: Both the AppContext switch AND the environment variable must be enabled. The AppContext switch is the outer gate - if false, IsSqPollRequested() returns immediately without checking the env var, and the JIT can statically eliminate all SQPOLL branches.

Debug-Only Test Controls

All DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_* environment variables are gated behind #if DEBUG:

TEST_DIRECT_SQE (0/1): disable/enable direct SQE submission
TEST_ZERO_COPY_SEND (0/1): disable/enable zero-copy send
TEST_REGISTER_BUFFERS: control registered buffer behavior
TEST_PROVIDED_BUFFER_SIZE: override provided buffer size
TEST_ADAPTIVE_BUFFER_SIZING (1): force adaptive sizing on
TEST_PREPARE_QUEUE_CAPACITY: override prepare queue capacity
TEST_QUEUE_ENTRIES: override SQ ring size (must be power of 2, 2-1024)
TEST_FORCE_EAGAIN_ONCE_MASK: comma-separated opcode names for forced EAGAIN
TEST_FORCE_ECANCELED_ONCE_MASK: comma-separated opcode names for forced ECANCELED

6. Safety and Correctness Measures

Fd Lifetime Management

Every direct SQE preparation takes a DangerousAddRef on the socket's SafeSocketHandle, stored in _completionSlotStorage[slotIndex].DangerousRefSocketHandle. This keeps the fd alive from SQE prep through CQE retirement, preventing fd-reuse races after close. The ref is released in FreeCompletionSlot.

Stale CQE Protection

Generation-based ABA protection. Each completion slot starts at generation 1. On free, generation increments (wrapping from 2^43-1 to 1, skipping 0). CQE dispatch compares the CQE's encoded generation against the slot's current generation; mismatches are silently dropped as stale.

Zero-Copy Send Lifecycle

SEND_ZC produces two CQEs: a data completion and a NOTIF. The slot's IsZeroCopySend and ZeroCopyNotificationPending flags track this two-phase lifecycle. After the first CQE, the slot is kept alive and the tracked operation is reattached via TryReattachTrackedIoUringOperation (generation CAS from 0 to new generation, then operation CAS from null to operation). The NOTIF CQE triggers HandleZeroCopyNotification which frees the slot and releases the pin hold.

Multishot Accept Arming

The _multishotAcceptState field uses a three-state protocol: 0 (disarmed), 1 (arming - SQE being written but user_data not yet published), or the encoded user_data value itself (armed). GetArmedMultishotAcceptUserDataForCancellation spins briefly if the arming transition is in flight.

Teardown Ordering

LinuxFreeIoUringResources follows a strict multi-phase teardown:

Unregister provided buffer ring (needs ring fd)
Mark registered ring fd inactive
Close wakeup eventfd
Unmap rings via CleanupManagedRings (also closes ring fd, terminating SQPOLL thread)
Disable managed flags
Drain queued operations (DrainQueuedIoUringOperationsForTeardown runs twice - once before and once after native port closure to catch late-arriving items)
Drain tracked operations via DrainTrackedIoUringOperationsForTeardown
Clear all aliasing pointers before NativeMemory.Free
Zero all state fields and publish final diagnostics

CleanupManagedRings nulls all mmap-derived pointers before unmapping to prevent use-after-unmap.

Nullable Avoidance

The SQE retry drain path avoids wrapping SocketEventHandler (a struct) in a Nullable<T> wrapper. Presence is tracked via a separate drainHandlerInitialized boolean, avoiding boxing pressure on the hot path.

SQE Size Validation

TryGetNextManagedSqe checks ringInfo.SqeSize != (uint)sizeof(IoUringSqe) at runtime, catching 128-byte SQE kernels that would corrupt the ring. TryMmapRings additionally rejects SetupSqe128 negotiations.

7. Performance Optimizations

CQ Head Advance Batching

Outside of overflow recovery, CQ head advances are deferred: _managedCachedCqHead is incremented locally and the single Volatile.Write to *_managedCqHeadPtr happens once at the end of the drain batch (in the finally block). During overflow recovery, advances happen per-CQE to relieve kernel pressure.

SQE Zeroing

Each TryGetNextManagedSqe call writes Unsafe.WriteUnaligned(sqe, default(IoUringSqe)) for JIT-vectorized 64-byte zeroing before returning the SQE. This eliminates stale field concerns and enables each Write*Sqe method to write only the fields it needs.

SQE Writer Deduplication

Send-like operations share WriteSendLikeSqe (differing only by opcode: Send vs SendZc). Sendmsg-like operations share WriteSendMsgLikeSqe (SendMsg vs SendMsgZc). This reduces copy-paste without sacrificing readability.

SQE Acquire With Retry

TryAcquireManagedSqeWithRetry attempts up to MaxIoUringSqeAcquireSubmitAttempts (16) rounds. Between retries, it runs DrainCqeRingBatch to free CQ slots, then submits pending SQEs. The drain handler is lazily initialized to avoid struct construction on the fast path.

Completion Slot Drain Recovery

When AllocateCompletionSlot returns -1 (pool exhausted), the engine drains CQEs inline (guarded by _completionSlotDrainInProgress to prevent recursion) and retries allocation.

Provided Buffer Deferred Recycle

BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop. Buffer descriptor writes accumulate without individual Volatile.Write tail publishes. A single tail publish happens at EndDeferredRecyclePublish.

Diagnostics Polling

Diagnostic counters are polled every IoUringDiagnosticsPollInterval (64) event loop iterations, not on every CQE. Managed deltas are accumulated in per-engine fields and published in batch to SocketsTelemetry.

Lazy Lock Allocation

_multishotAcceptQueueGate and _persistentMultishotRecvDataGate on SocketAsyncContext are lazy-initialized via EnsureLockInitialized (CAS from null). Most sockets never use these paths, so the Lock objects are only allocated when needed.

Event Loop Wait

The event loop first tries a non-blocking DrainCqeRingBatch. If no CQEs are available, it issues io_uring_enter with GETEVENTS and a 50ms EXT_ARG timeout (bounded wait). This trades worst-case 50ms latency for starvation resilience when eventfd wakes are missed or deferred.

8. Telemetry and Observability

Stable PollingCounters (10)

Published when the EventSource is enabled on Linux. Counter names are centralized in IoUringCounterNames:

Counter	What to watch for
`io-uring-prepare-nonpinnable-fallbacks`	Operations that couldn't use direct preparation
`io-uring-socket-event-buffer-full`	Event buffer capacity pressure
`io-uring-cq-overflows`	Event loop can't keep up with kernel completions
`io-uring-cq-overflow-recoveries`	Successful overflow recovery completions
`io-uring-prepare-queue-overflows`	Submission queue capacity pressure
`io-uring-prepare-queue-overflow-fallbacks`	Operations that fell back to readiness dispatch
`io-uring-completion-slot-exhaustions`	Slot capacity pressure
`io-uring-provided-buffer-depletions`	Provided buffer ring ran out of buffers
`io-uring-sqpoll-wakeups`	SQPOLL kernel thread wakeups from idle
`io-uring-sqpoll-submissions-skipped`	Zero-syscall fast path hits (SQPOLL)

Diagnostic Backing Fields (17)

Written internally for structured logging and test access. Not published as PollingCounters. Include:

Async cancel CQE counts
Completion requeue failures
Zero-copy notification pending slots gauge
Prepare queue depth
Completion slot drain recoveries
Provided buffer current size, recycles, resizes
Registered buffer initial/re-registration success and failure
Fixed recv selected/fallbacks
Persistent multishot recv reuse, termination, early data

Startup Events

ReportIoUringResolvedConfiguration: Logged once with all resolved config inputs
ReportSocketEngineBackendSelected (event ID 7): Reports io_uring vs. epoll selection and SQPOLL status
ReportIoUringSqPollNegotiatedWarning: WARNING-level when SQPOLL is negotiated

Structured Logging

IoUringDiagnostics.Linux.cs centralizes all log helpers with NetEventSource.Info/Error:

CQ overflow detection and recovery (with branch discriminator)
Async cancel prepare/submit failures (with teardown/runtime origin)
Queue overflow events
Teardown summary (benign late completion count)
Advanced feature state snapshot
Untrack mismatches

Collectible via dotnet-counters, dotnet-trace, or any OpenTelemetry-compatible collector.

9. Test Coverage

Test Access Architecture

The test project does not use InternalsVisibleTo. Instead:

IoUringTestAccessors.Linux.cs (938 lines) defines all test-visible snapshot types and accessor methods inside SocketAsyncEngine (production assembly)
InternalTestShims.Linux.cs (644 lines) in the test project mirrors these types and resolves them via reflection
A [DynamicDependency(DynamicallyAccessedMemberTypes.All, "System.Net.Sockets.SocketAsyncEngine", "System.Net.Sockets")] attribute preserves all targets under trimming and AOT

Test Suite (132 test methods across 6,665 lines)

Coverage areas:

All operation types: send, recv, accept, connect, sendmsg, recvmsg
Completion mode vs. fallback: forced-fallback tests via environment variables
Per-opcode disable: env-var-driven opcode disabling for isolation
Forced-result injection: EAGAIN and ECANCELED injection per opcode (#if DEBUG)
Multishot accept: basic flow, cancellation, queue drain, dispose-during-arming race, one-shot fallback (deterministic via reflection override)
Multishot recv: basic iteration, cancellation, peer close, early data buffering, multishot gating by socket type (datagram exclusion)
Provided buffers: depletion, recycling, adaptive sizing, registered buffer toggle
Zero-copy send: threshold behavior, notification lifecycle, mixed mode
SQPOLL mode: basic send/receive, fallback, idle wakeup, multishot recv, zero-copy send, telemetry, SQ_NEED_WAKEUP contract (7 dedicated tests)
CQ overflow recovery: five-test suite covering all three branches
- Test 1: inject overflow, verify telemetry counter increment and slot/op settlement
- Test 2 (branch a): multishot accept arming during overflow - no silent drop
- Test 3 (branch b): teardown under overflow - no deadlock within 60s
- Test 4: DEBUG single-issuer assertion fires on non-event-loop thread
- Test 5 (branch c): sustained 10s adversarial overflow injection with concurrent workload
Layout contracts: NativeMsghdrLayoutContract_IsStable and CompletionSlotLayoutContract_IsStable verify ABI alignment via reflection
Reflection target stability: CqOverflow_ReflectionTargets_Stable ensures field names are documented and stable
CLOEXEC: RingFd_HasCloexecFlag_Set verifies the FD_CLOEXEC bit via fcntl
ARM64 and concurrency: ARM64 MPSC stress, generation-transition stress, concurrent resize-swap
Cancellation: concurrent cancel/submit contention, teardown drain
Buffer pressure: bounded queue capacity, slot exhaustion recovery
Telemetry: stable counter name contract validation, counter increment verification
Config: dual opt-in SQPOLL validation, removed-knobs-default-enabled verification
Teardown: clean shutdown, resource cleanup
Non-pinnable fallback publication: concurrent publisher stress test via reflection shim

Hard to Test In-Process

True CQ overflow (requires kernel-level timing control; mitigated by managed overflow counter injection via reflection)
RLIMIT_MEMLOCK failures (requires container-level constraints)
Kernel version degradation (requires multiple kernel environments)
SQPOLL CPU consumption (requires system-level profiling)
Real-world latency distributions (requires benchmark infrastructure)

10. Graceful Degradation

Condition	Behavior
Kernel < 6.1	Epoll used
Env var not set to "1" (and no AppContext switch)	Epoll used
io_uring_setup fails	Epoll fallback
SQPOLL not supported (EINVAL or EPERM)	Flag peeled; DEFER_TASKRUN added; engine continues
NO_SQARRAY unsupported	Flag peeled; SQ array identity-mapped
CLOEXEC unsupported	Flag peeled; fcntl FD_CLOEXEC fallback
Opcode probe fails	Advanced opcodes disabled; basic ops still work
Provided buffer ring fails	Multishot recv disabled; one-shot recv with inline buffers
RLIMIT_MEMLOCK prevents buffer registration	Engine continues without registered buffers
Completion slot exhaustion	Drain CQEs inline; retry allocation; fall back to readiness dispatch
Prepare queue overflow	Fall back to readiness dispatch for overflowed op
CQ overflow detected	Three-branch recovery state machine; delayed stale sweep
SQE ring full	Retry with intermediate submit + CQ drain (up to 16 attempts)
NativeMsghdr layout unsupported (non-64-bit)	io_uring disabled entirely

11. Path to Default-On

Opt-in environment variable (current state)
Extensive testing (CI, stress tests, TechEmpower)
Default-on for kernel >= 6.1 with runtime capability detection
Remove the gate; io_uring is the Linux backend

SQPOLL will likely remain opt-in permanently due to its CPU cost trade-off.

Future Kernel Features

Incremental buffer rings (kernel 6.12+): Partial buffer consumption without full ring cycle
RecvSend bundles (kernel 6.10+): Single SQE performs recv then send
Zero-copy RX (kernel 6.7+): True zero-copy receive sharing NIC ring buffers

12. Distribution Readiness

Kernel Version Matrix

The minimum kernel cutoff is a single 6.1 requirement. All sub-features are detected at runtime via opcode probing.

Distribution	Version	Kernel	io_uring (6.1+)
Ubuntu 24.04 LTS	GA	6.8	Yes
Ubuntu 22.04 LTS	GA	5.15	No (epoll fallback)
Ubuntu 22.04 LTS	HWE	6.8	Yes
RHEL 10	GA	6.12	Yes
RHEL 9	GA	5.14	No (epoll fallback)
Debian 13 (Trixie)	GA	6.12	Yes
Debian 12 (Bookworm)	GA	6.1	Yes
Azure Linux 3	GA	6.6	Yes
Amazon Linux 2023	Default	6.1	Yes
Amazon Linux 2	Default	5.10	No (epoll fallback)

Memory Overhead

Component	Size	Notes
SQ ring	~16KB	1024 entries
CQ ring	~64KB	4096 entries (4x SQ)
SQE array	~64KB	1024 entries * 64B
Provided buffer pool	~4MB	1024 * 4KB default
Completion slots (hot)	~256KB	8192 slots * 32B
Completion slot storage (cold)	~varies	Managed object array
Native per-slot slab	~varies	NativeMemory.AllocZeroed
Zero-copy pin holds	~64KB	8192 * sizeof(MemoryHandle)
Total	~5.5MB+	Per engine instance (userspace)

13. Conclusion

This implementation delivers a complete io_uring integration with:

9 partial class files totaling ~7,200 lines for the engine alone, plus 2,721 lines for per-context operation lifecycle, 1,013 lines for the provided buffer ring, 294 lines for the MPSC queue, and 687 lines for telemetry
132 tests across 6,665 lines with full reflection-shim test access pattern
10 stable telemetry counters plus 17 diagnostic backing fields
Three-branch CQ overflow recovery with post-recovery stale operation sweep
#if DEBUG-gated test hooks for deterministic failure injection
[FeatureSwitchDefinition] annotations for JIT elimination of SQPOLL branches
Comprehensive fd lifetime management via DangerousAddRef/DangerousRelease
Layout contract tests for ABI stability

The managed-ring architecture (minimal native shim + C# ring management) trades a small initial complexity cost for long-term maintainability: standard .NET breakpoints, managed stack traces, EventSource telemetry, and xUnit tests in the same language as the implementation.

The code is production-ready with the current opt-in gate. The environment variable requirement is appropriate for the initial release. Graceful degradation means unexpected issues fall back to the proven epoll path.

Copilot

Pull request overview

This PR implements an experimental opt-in io_uring-backed socket event engine for Linux as an alternative to epoll. The implementation is comprehensive, including both readiness-based polling (Phase 1) and completion-based I/O operations (Phase 2), along with extensive testing infrastructure and evidence collection tooling.

Changes:

Native layer: cmake configuration, PAL networking headers, and io_uring system call integration with graceful epoll fallback
Managed layer: socket async engine extensions for io_uring completion handling, operation lifecycle tracking, buffer pinning, and telemetry
Testing: comprehensive functional tests, layout contract validation, stress tests, and CI infrastructure for dual-mode test execution
Tooling: evidence collection and validation scripts for performance comparison and envelope testing

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/native/libs/configure.cmake	Adds CMake configuration checks for io_uring header and poll32_events struct member
src/native/libs/System.Native/pal_networking.h	Defines new io_uring interop structures (IoUringCompletion, IoUringSocketEventPortDiagnostics) and function signatures
src/native/libs/System.Native/entrypoints.c	Registers new io_uring-related PAL export entry points
src/native/libs/Common/pal_config.h.in	Adds CMake defines for io_uring feature detection
src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs	Adds layout contract tests for io_uring interop structures and telemetry counter verification
src/libraries/System.Net.Sockets/tests/FunctionalTests/System.Net.Sockets.Tests.csproj	Implements MSBuild infrastructure for creating io_uring test archive variants (enabled/disabled/default)
src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs	Adds comprehensive functional and stress tests for io_uring socket workflows
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs	Adds 12 new PollingCounters for io_uring observability metrics
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs	Implements managed wrappers for io_uring prepare operations with error handling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs	Core io_uring integration: submission batching, completion handling, operation tracking, and diagnostics polling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs	Operation-level io_uring support: buffer pinning, user_data allocation, completion processing, and state machine
src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs	Defines managed interop structures matching native layout for io_uring operations
eng/testing/io-uring/validate-collect-sockets-io-uring-evidence-smoke.sh	Smoke validation script for evidence collection tooling
eng/testing/io-uring/collect-sockets-io-uring-evidence.sh	Comprehensive evidence collection script for functional/perf validation and envelope testing
docs/workflow/testing/libraries/testing.md	Adds references to io_uring-specific documentation
docs/workflow/testing/libraries/testing-linux-sockets-io-uring.md	Detailed validation guide for io_uring backend testing
docs/workflow/testing/libraries/io-uring-pr-evidence-template.md	PR evidence template for documenting io_uring validation results

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

Copilot

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs

Copilot

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

src/native/libs/System.Native/pal_networking.c

Copilot

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 18 out of 20 changed files in this pull request and generated 7 comments.

src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

src/native/libs/configure.cmake

…ctor

Copilot

Pull request overview

Copilot reviewed 28 out of 31 changed files in this pull request and generated no new comments.

…tion on single-core

…etry batching, packed capabilities, and native shim hardening Merge-blocking: - Set SOCK_CLOEXEC | SOCK_NONBLOCK on accept SQEs via AcceptFlags constant - Add EPERM to IsIgnoredIoUringSubmitError and drain rejected SQEs as failed completions instead of re-queuing them (breaks infinite-retry spin) - Replace SEND_ZC reattach FailFast with Debug.Fail, slot cleanup, and error completion; add generation check in HandleZeroCopyNotification Performance: - Batch per-CQE telemetry (depletion/recycle/early-data) into drain-batch accumulators flushed once per DrainCqeRingBatch - Replace Interlocked.CompareExchange pair in TryTrackPreparedIoUringOperation with Volatile.Read/Write (event-loop-only path) - Move 5 permanently-true SQE ring invariant checks to one-time init validation - Convert CQE tag dispatch from switch to if-chain for branch prediction - Clear only NativeMsghdr header instead of full native storage stride - Copy buffered multishot recv data outside lock Capabilities and telemetry: - Convert LinuxIoUringCapabilities from 8-bool positional struct to packed uint flags with fluent With* mutators - Add slot high-water-mark and cancellation-queue-overflow production counters - Add capacity planning comments near SlotIndexBits - Add Debug.Assert on non-EXT_ARG fallback path Security and resilience: - Add BitOperations.IsPow2 asserts on kernel-reported ring sizes in TryMmapRings - Add c_static_assert(sizeof(size_t) >= 8) in native shim - Add ringFd < 0 validation at all native shim entry points - Wrap DangerousRelease in try/finally in FreeCompletionSlot - Block provided-buffer resize during CQ overflow recovery - Fix FreeIoUringProvidedBufferRing transient inconsistent capability state - Guard Dispose against freeing registered ring memory - Replace _persistentMultishotRecvDataQueueCount with computed property - Use Volatile.Write for teardown TrackedOperationGeneration clear - Silently ignore TagNone CQEs from ASYNC_CANCEL completions - Add EINTR comment on native shim CloseFd Tests: - Add accepted-socket FD_CLOEXEC and O_NONBLOCK verification test - Add forced-submit-EPERM graceful degradation test

…buffer ring group ID hardening, sweep re-arm cap, wake circuit-breaker, and test coverage MpscQueue: - Co-locate Items and States into single SegmentEntry[] array for cache locality - Add TryEnqueue with bounded retry (MaxEnqueueSlowAttempts=2048) and SpinWait backoff; catch OOM in RentUnlinkedSegment - Handle TryEnqueue failure at prepare-queue and cancel-queue call sites - Remove AggressiveInlining from lock-containing Rent/ReturnUnlinkedSegment - Promote ARM64 concurrent stress test from OuterLoop to regular CI Code quality: - Collapse redundant WriteSendSqe/WriteSendZcSqe and WriteSendMsgSqe/ WriteSendMsgZcSqe wrappers; call WriteSendLikeSqe/WriteSendMsgLikeSqe directly - Replace string-based telemetry test hook dispatch with IoUringCounterFieldForTest enum for compile-time safety - Centralize 11 debug test env vars into IoUringTestEnvironmentVariables class - Move s_ioUringResolvedConfigurationLogged to per-engine instance field - Add SQE zeroing socket-only assumption comment Resilience: - Replace fragile group ID toggle (1/2) with sequential allocation starting at 0x8000 to avoid collision with other io_uring users - Cap CQ overflow stale-tracked sweep re-arms at 8 with diagnostic log - Add eventfd wake failure circuit-breaker: after 8 consecutive failures, reduce completion wait timeout from 50ms to 1ms; reset on successful wake - Null out multishot accept sockaddr pointer to eliminate shared-buffer race Performance: - Add _nextPreparedReceivePostedWordHint for O(1) common-case bitset search in TryAcquireBufferForPreparedReceive; update hint on recycle and acquisition - Remove AggressiveInlining from IsProvidedBufferResizeQuiescent - Bound TryAcquireBufferForPreparedReceive retry by word count instead of ring size Tests: - CQ overflow recovery with zero tracked operations - Wakeup eventfd FD_CLOEXEC verification - SQPOLL + DEFER_TASKRUN mutual exclusivity assertion - NativeMsghdr 32-bit rejection path - UDP oversized datagram with zero-length ReceiveFrom buffer - CounterDelta monotonicity assertion (replaces silent underflow) - Clarify zero-copy small-buffer test name and forced-error intent

…oA split, cancellation batching, configuration centralization, registered ring fd EINVAL fallback, and test coverage - Convert static io_uring counters to per-engine instance fields with aggregation - Group 20+ managed ring mmap fields into ManagedRingState struct with property accessors - Split TrackedOperation/TrackedOperationGeneration into separate IoUringTrackedOperationState array for cache locality; shrink IoUringCompletionSlot from 32 to 24 bytes - Batch ProcessCancellation ThreadPool callbacks via static ConcurrentQueue with cooperative worker drain - Replace ConcurrentQueue<SocketIOEvent> with MpscQueue on Linux via SocketIOEventQueue wrapper - Centralize configuration resolution into IoUringConfigurationInputs with contradiction validation warnings - Collapse CounterPair struct into static TryPublishManagedCounterDelta method - Add registered ring fd EINVAL fallback on all four io_uring_enter call sites (submit, SQPOLL wakeup, EXT_ARG wait, non-EXT_ARG wait) - Treat kernel EINVAL from submit as drainable error; convert internal invariant violations to ThrowInternalException(string) to bypass drain - Add MpscQueue drained-segment recycling with slow-path-only producer quiescence tracking - Add provided-buffer ring OOM test hook and EINTR retry limit test hook in native shim - Replace ThrowInternalException with Debug.Fail at unreachable/defensive sites in slots and dispatch - Add tests for generation wrap-around dispatch, fork/exec close-on-exec, queue saturation, slot capacity stress, kernel version fallback, cancellation routing, and MpscQueue OOM recovery

…egate wrapper allocation

… Debug-only compilation and Release stubs

Copilot

Pull request overview

Copilot reviewed 29 out of 32 changed files in this pull request and generated 1 comment.

Copilot · 2026-02-19T19:51:02Z

src/native/libs/System.Native/pal_io_uring_shim.c

+// Layout assertions for managed interop structs (kernel struct mirrors).
+c_static_assert(sizeof(size_t) >= 8);
+c_static_assert(sizeof(size_t) == sizeof(void*));


c_static_assert(sizeof(size_t) >= 8) (and the following pointer-size asserts) will fail compilation on 32-bit Linux, even if io_uring is meant to be disabled there. Consider gating SHIM_HAVE_IO_URING (or just these layout asserts) on 64-bit (e.g., __SIZEOF_POINTER__ == 8) so System.Native still builds for 32-bit targets and the shim can fall back to the stub implementations.

Suggested change

// Layout assertions for managed interop structs (kernel struct mirrors).

c_static_assert(sizeof(size_t) >= 8);

c_static_assert(sizeof(size_t) == sizeof(void*));

// Layout assertions for managed interop structs (kernel struct mirrors).

#if defined(__SIZEOF_POINTER__) && __SIZEOF_POINTER__ == 8

c_static_assert(sizeof(size_t) >= 8);

c_static_assert(sizeof(size_t) == sizeof(void*));

#endif

…submitter_task, drain all non-EFAULT submit errors, and test coverage

…AULT submit errors, EINVAL registered-ring-fd fallback, and source-specific error context

… cached segments

Copilot

Pull request overview

Copilot reviewed 29 out of 32 changed files in this pull request and generated no new comments.

deathly809 · 2026-02-19T21:46:52Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/MpscQueue.cs

+                if (next is not null)
+                {
+                    Interlocked.CompareExchange(ref _tail.Value, next, tail);
+                }


Are we assuming that _tail.Value is eventually consistent? Otherwise, I believe this scenario could end up with the tail having some invalid value,

At the top you grab the current tail,

Segment tail = Volatile.Read(ref _tail.Value)!;

Then if the entry array is full you continue to create a new tail. If that fails you will "refresh" the next variable to the current next.

next = Volatile.Read(ref tail.Next);

Now, assume that your thread context switches out at this point and some other thread(s) enqueues a bunch of items that causes a new tail to be added.

Then we context switch back in and since tail and _tail.Value are not the same you will set _tail.Value to next but the value of next points to the previous tail.

Maybe this was this part of the description

Segment recycling limited to segments that lost the tail-link CAS race (never previously published), avoiding need for producer quiescence tracking

deathly809 · 2026-02-19T21:49:07Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/MpscQueue.cs

+            while (true)
+            {
+                Segment tail = Volatile.Read(ref _tail.Value)!;
+                int index = Interlocked.Increment(ref tail.EnqueueIndex.Value) - 1;


Based on my comment below (or above if reading from discussion page) could we cause a certain race condition that keeps resetting _tail.Value to a previous value and if we keep incrementing this value, we hit an integer overflow?

deathly809 · 2026-02-19T21:57:35Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/MpscQueue.cs

+        {
+            get
+            {
+                Segment head = Volatile.Read(ref _head.Value)!;


This seems pretty computationally heavy; is there a reason you just can't have a single _count variable that you atomically increment/decrement that you just check for 0 here?

deathly809 · 2026-02-19T22:21:46Z

...stem.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.IoUringCompletionDispatch.Linux.cs

+                        fixedRecvBufferId,
+                        ref completionAuxiliaryData))
+                {
+                    completionResultCode = -Interop.Sys.ConvertErrorPalToPlatform(Interop.Error.ENOBUFS);


Why the negation? I see you do it below as well. I did a quick search around the repo and only saw this referenced in one other place and they did not do negation and the folks referencing that code don't appear to being a negation either.

deathly809 · 2026-02-19T22:25:37Z

src/native/libs/System.Native/pal_io_uring_shim.c

+    int32_t state = atomic_load_explicit(&s_forceEnterEintrRetryLimitOnce, memory_order_relaxed);
+    if (state < 0)
+    {
+        const char* configuredValue = getenv(SHIM_TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE_ENV);


Should this be behind a #ifdef DEBUG?

deathly809 · 2026-02-19T22:36:39Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs

        private const string ConnectActivityName = ActivitySourceName + ".Connect";
        private static readonly ActivitySource s_connectActivitySource = new ActivitySource(ActivitySourceName);

+        internal static class Keywords


Maybe IoUringKeywords would be a better description.

deathly809 · 2026-02-19T22:46:44Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

+#if DEBUG
+            // Test-only knob to make wait-buffer saturation deterministic for io_uring diagnostics coverage.
+            // Only available in DEBUG builds so production code never reads test env vars.
+            if (OperatingSystem.IsLinux())


Should you also check DOTNET_SYSTEM_NET_SOCKETS_IO_URING or do we assume that DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_EVENT_BUFFER_COUNT is only set when the feature flag is enabled?

deathly809 · 2026-02-19T23:00:42Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

            try
            {
+                RecordAndAssertEventLoopThreadIdentity();
+                LinuxEventLoopEnableRings();


Wonder if these could be more generic. i.e.

LinuxEventLoopEnableRings -> EventLoopInit
LinuxEventLoopBeforeWait -> EventLoopBeforeWait
LinuxEventLoopTryCompletionWait -> EventLoopTryCompleteWait
etc.

Hmm, I guess it would be an issue if someone wanted to add their own "EventLoopInit" or equivalent for the other methods :)

deathly809 · 2026-02-19T23:59:38Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Linux.cs

+            }
+            else
+            {
+                Debug.Assert(


Does this mean we have not tested this on Kernels before 6.1?

deathly809 · 2026-02-20T00:23:56Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Linux.cs

+            {
+                // Snapshot the wakeup generation counter before entering the blocking syscall.
+                // After waking, we compare to detect wakeups that arrived during the syscall.
+                uint wakeGenBefore = Volatile.Read(ref _ioUringWakeupGeneration);


Going to need to define this outside the if statement so you can reference it after the if/else

Copilot AI review requested due to automatic review settings February 13, 2026 11:18

github-actions bot added the area-System.Net.Sockets label Feb 13, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 13, 2026

Copilot started reviewing on behalf of benaadams February 13, 2026 11:19 View session