Skip to content

Use io_uring for sockets on Linux#124374

Open
benaadams wants to merge 258 commits intodotnet:mainfrom
benaadams:io_uring
Open

Use io_uring for sockets on Linux#124374
benaadams wants to merge 258 commits intodotnet:mainfrom
benaadams:io_uring

Conversation

@benaadams
Copy link
Member

@benaadams benaadams commented Feb 13, 2026

Contributes to #753

Summary

This document describes the complete, production-grade io_uring socket I/O engine in .NET's System.Net.Sockets layer.

When enabled via DOTNET_SYSTEM_NET_SOCKETS_IO_URING=1 on Linux kernel 6.1+, the engine replaces epoll with a managed io_uring completion-mode backend that:

  • Directly writes SQEs to mmap'd kernel ring buffers from C#
  • Processes CQEs inline on the event loop thread
  • Supports multishot accept, multishot recv with provided buffer rings, zero-copy send (SEND_ZC/SENDMSG_ZC), registered files, registered buffers, adaptive buffer sizing, and SQPOLL kernel-side submission polling
  • Recovers safely from CQ overflow across three discriminated branches
  • Sweeps stale tracked operations after CQ overflow recovery via a delayed-deadline mechanism

The native shim is intentionally minimal - 433 lines of C wrapping the three io_uring syscalls (setup, enter, register) plus eventfd and mmap helpers. All ring management, SQE construction, CQE dispatch, operation lifecycle, feature negotiation, overflow recovery, and SQPOLL wakeup detection lives in managed code.

The engine proper is organized as eight partial class files extending SocketAsyncEngine: the main file (SocketAsyncEngine.Linux.cs, 3848 lines) holds ring setup, flag negotiation, CQE drain, SQE prep orchestration, completion slot layout, and the event loop; the remaining seven partials handle ring mmap lifecycle (IoUringRings, 343 lines), completion slot pool management (IoUringSlots, 437 lines), SQE writing (IoUringSqeWriters, 327 lines), completion dispatch (IoUringCompletionDispatch, 668 lines), diagnostics logging (IoUringDiagnostics, 324 lines), configuration resolution (IoUringConfiguration, 128 lines), and debug test hooks (IoUringTestHooks, 214 lines). A separate IoUringTestAccessors.Linux.cs file (938 lines) exposes all test-observable state through strongly-typed accessors. Tests access this surface through InternalTestShims.Linux.cs (644 lines), a centralized reflection shim with [DynamicDependency] annotations for trimmer/AOT safety.

Key metrics:

Metric Value
Partial class files (SocketAsyncEngine) 9 (main + 8 partials)
New managed source lines (socket layer) ~11,400
Native shim lines ~433 (C) + 27 (header)
New tests ~132 (ConditionalFact/ConditionalTheory in IoUring.Unix.cs)
Test lines ~6,665 (IoUring.Unix.cs) + 644 (InternalTestShims)
Breaking API changes 0 - purely additive, behind opt-in env var

2. Architecture

Ring Ownership and Event Loop

The architecture follows the SINGLE_ISSUER contract: exactly one thread - the event loop thread - owns the io_uring instance. All ring mutations (SQE writes, CQ head advances, io_uring_enter calls) happen on this thread. Other threads communicate via two MPSC queues.

graph TD
    WT[Worker Threads] -->|"MpscQueue<IoUringPrepareWorkItem>"| EL[Event Loop Thread]
    WT -->|"MpscQueue<ulong> (cancel)"| EL
    WT -->|"eventfd write (wake)"| EL
    EL -->|"Writes SQEs / Drains CQEs / io_uring_enter"| K[Kernel - io_uring]
    K -->|"CQE completions"| EL
    EL -->|"ThreadPool.QueueUserWorkItem"| TP[ThreadPool]
Loading

The Thin Native Shim Approach

The native shim (pal_io_uring_shim.c, 433 lines) wraps exactly:

  • io_uring_setup (via syscall(__NR_io_uring_setup, ...) with SYS_io_uring_setup fallback)
  • io_uring_enter (with and without EXT_ARG)
  • io_uring_register
  • mmap / munmap (for ring mapping)
  • eventfd / read / write (for cross-thread wakeup; EINTR-looped)
  • uname (for kernel version detection)

All ring pointer arithmetic, SQE field population, CQE parsing, SQPOLL wakeup detection (via Volatile.Read on the mmap'd SQ flags word), overflow recovery, and operation lifecycle management happens in managed C#. This is deliberate:

  • Managed code is easier to debug, profile, and modify. The JIT can inline hot paths. No P/Invoke on the SQE write path.
  • The shim compiles on any Linux with <linux/io_uring.h> - no liburing dependency.
  • Feature negotiation (flag peeling, opcode probing) is entirely managed and testable.
  • Requires exact ABI-level knowledge of kernel structs (mitigated by _Static_assert(IORING_SETUP_CLOEXEC == (1U << 19), ...) in the shim and layout contract tests in C#).

Threading Model

The event loop thread owns:

  • The io_uring ring fd and all mmap'd ring pointers
  • All SQE writes and CQ drains
  • The _completionSlots[] / _completionSlotStorage[] arrays
  • Eventfd registered-file entry management
  • Adaptive buffer sizing evaluation
  • SQPOLL idle detection via SQ_NEED_WAKEUP on the mmap'd SQ flags pointer
  • CQ overflow recovery state machine

Worker threads interact solely through:

  • TryEnqueueIoUringPreparation() -> MPSC prepare queue -> eventfd write
  • TryRequestIoUringCancellation() -> MPSC cancel queue -> eventfd write
  • Volatile.Read on _ioUringTeardownInitiated to avoid publishing work after shutdown

Partial Class File Organization

File Lines Responsibility
SocketAsyncEngine.Linux.cs 3,848 Core: ring setup, flag negotiation, CQE drain loop, SQE prep orchestration, event loop hooks, completion slot lifetime, tracked operation management, overflow recovery, SQPOLL wakeup, queue management, feature resolution
SocketAsyncEngine.IoUringSlots.Linux.cs 437 SoA completion slot allocation, free-list management, native per-slot slab layout, message header inline copy/writeback, zero-copy pin hold transfer, slot encode/decode
SocketAsyncEngine.IoUringRings.Linux.cs 343 TryMmapRings: maps SQ/CQ/SQE regions, validates mmap offset bounds, derives all ring pointers. CleanupManagedRings: multi-step teardown. LinuxFreeIoUringResources: full teardown orchestration
SocketAsyncEngine.IoUringSqeWriters.Linux.cs 327 All Write*Sqe methods: send, sendZc, recv, readFixed, providedBufferRecv, multishotRecv, accept, multishotAccept, sendMsg, sendMsgZc, recvMsg, connect, asyncCancel. Deduplicated via WriteSendLikeSqe and WriteSendMsgLikeSqe
SocketAsyncEngine.IoUringCompletionDispatch.Linux.cs 668 SocketEventHandler partial: DispatchSingleIoUringCompletion, DispatchMultishotIoUringCompletion, DispatchZeroCopyIoUringNotification, multishot accept/recv dispatch, buffer materialization, completion result routing
SocketAsyncEngine.IoUringDiagnostics.Linux.cs 324 Structured NetEventSource.Info/Error log helpers for all io_uring events: async-cancel failures, queue overflows, CQ overflow entry/completion with branch discriminator, deferred rearm nudge, teardown summary, advanced feature state
SocketAsyncEngine.IoUringConfiguration.Linux.cs 128 IsIoUringEnabled, IsSqPollRequested, IsZeroCopySendOptedIn, IsIoUringDirectSqeDisabled with [FeatureSwitchDefinition] annotations for JIT-eliminable code paths
SocketAsyncEngine.IoUringTestHooks.Linux.cs 214 #if DEBUG-gated EAGAIN/ECANCELED forced result injection, per-opcode mask parsing from environment, result application/resolution/restoration
SocketAsyncEngine.IoUringTestAccessors.Linux.cs 938 Strongly-typed snapshot structs and accessor methods for all testable engine state

Submission Path: Standard vs. SQPOLL

In standard mode, io_uring_enter submits pending SQEs and optionally waits for CQEs. In SQPOLL mode, a kernel thread continuously polls the SQ ring. Managed code detects idle via Volatile.Read on the mmap'd _managedSqFlagsPtr checking for IORING_SQ_NEED_WAKEUP. When the kernel thread is awake, no io_uring_enter is needed for submission.

Flag Negotiation (Peel Loop)

Setup builds an initial flag set: CQSIZE | SUBMIT_ALL | COOP_TASKRUN | SINGLE_ISSUER | NO_SQARRAY | CLOEXEC. SQPOLL (mutually exclusive with DEFER_TASKRUN) or DEFER_TASKRUN is added based on configuration. On EINVAL, flags are peeled in order: NO_SQARRAY first, then CLOEXEC. EPERM is never retried (respects seccomp/kernel policy). After setup, FD_CLOEXEC is set as a fallback via fcntl for kernels where IORING_SETUP_CLOEXEC was peeled.

CQ Overflow Recovery State Machine

CQ overflow is detected on every DrainCqeRingBatch entry via ObserveManagedCqOverflowCounter, which compares the mmap'd overflow counter against the last-observed value using wrapping uint32 delta arithmetic. When a delta is seen, the engine enters a three-branch recovery state machine:

  • MultishotAcceptArming: Active when _liveAcceptCompletionSlotCount > 0 and not in teardown. Defers multishot accept re-arm nudges until post-drain.
  • Teardown: Active when _ioUringTeardownInitiated is set. Teardown owns recovery completion.
  • DualWave: Steady-state branch for all other overflow scenarios, including escalation when new overflow occurs during existing recovery.

During overflow recovery, CQ head advances happen per-CQE (not batched) to relieve kernel pressure immediately. Recovery completes when the CQ ring is fully drained and no new overflow delta is observed. On completion: AssertCompletionSlotPoolConsistency validates free-list integrity, telemetry is incremented, and for the MultishotAcceptArming branch, TryQueueDeferredMultishotAcceptRearmAfterRecovery nudges accept contexts.

After recovery completes, a delayed sweep (TrySweepStaleTrackedIoUringOperationsAfterCqOverflowRecovery) fires 250ms later to retire tracked operations whose CQEs were dropped. The sweep skips intentionally long-lived multishot accept and persistent multishot recv slots. Operations still in the waiting state are canceled; already-transitioned operations are detached and their slots freed.


3. Key Data Structures

Completion Slot Pool

Three parallel SoA arrays, all indexed by slot index:

  • IoUringCompletionSlot[] (hot, 32 bytes each, [StructLayout(LayoutKind.Explicit, Size = 32)]):

    • Offset 0: Generation (ulong) - 43-bit generation field
    • Offset 8: FreeListNext (int) - intrusive free list, -1 = end
    • Offset 12: _packedState (uint) - IoUringCompletionOperationKind in low 8 bits, boolean flags IsZeroCopySend/ZeroCopyNotificationPending/UsesFixedRecvBuffer in bits 8-10
    • Offset 16: FixedRecvBufferId (ushort)
    • Offset 24 (#if DEBUG only): TestForcedResult (int)
    • Layout gives exactly 2 slots per 64-byte cache line with zero split-line access
  • IoUringCompletionSlotStorage[] (cold): Per-slot tracked operation reference (TrackedOperation, TrackedOperationGeneration), DangerousRefSocketHandle for fd lifetime, pre-allocated native inline storage slab (NativeMsghdr + 4 IOVectors + 128B socket addr + 128B control + socklen_t), message writeback pointers for recvmsg.

  • MemoryHandle[] (zero-copy pin holds): One System.Buffers.MemoryHandle per slot index, holding the pin for SEND_ZC payloads until the NOTIF CQE arrives.

Layout contract tests verify IoUringCompletionSlot field offsets and the 32-byte total size via reflection on every test run. A Debug.Assert in InitializeCompletionSlotPool fires if the size drifts.

Generation Encoding

13-bit slot index (SlotIndexBits = 13, capacity 8192) and 43-bit generation (GenerationBits = 56 - 13 = 43, GenerationMask = (1UL << 43) - 1UL) packed into the 56-bit user_data payload. The upper 8 bits of user_data carry a tag byte (2 = reserved completion, 3 = wakeup signal). Generation is initialized to 1 (not 0) so stale CQEs referencing generation 0 are rejected. On wrap, generation remaps from 2^43-1 back to 1, skipping zero.

IoUringCompletionOperationKind

A 3-variant enum (None, Accept, Message) stored in the packed state of each IoUringCompletionSlot. This determines per-completion post-processing behavior: accept completions read sockaddr length from the native slab; message completions copy writeback data from the native msghdr.

IoUringCompletionDispatchKind

A 10-variant enum (Default, ReadOperation, WriteOperation, SendOperation, BufferListSendOperation, BufferMemoryReceiveOperation, BufferListReceiveOperation, ReceiveMessageFromOperation, AcceptOperation, ConnectOperation) stored as a packed integer inside each AsyncOperation, set at operation creation time and consumed at CQE dispatch to route completions without virtual dispatch. Defined in the shared Unix partial class (SocketAsyncContext.Unix.cs) so it compiles on all Unix TFMs.

MPSC Queue

MpscQueue<T> is a lock-free segmented queue with cache-line-padded head/tail pointers and an EnqueueIndex counter per segment. Features:

  • Platform-aware cache line padding: 128-byte on ARM64/LoongArch64, 64-byte otherwise
  • 4-slot unlinked segment cache (guarded by a small Lock) to reduce allocation pressure during burst enqueue patterns
  • Segment recycling limited to segments that lost the tail-link CAS race (never previously published), avoiding need for producer quiescence tracking
  • Fast path (TryEnqueueFast/TryDequeueFast) inlined for the common non-full/non-empty case
  • IsEmpty property is snapshot-based, not linearizable - a return of true can mean an enqueue is mid-flight

Provided Buffer Ring

IoUringProvidedBufferRing (1,013 lines): Kernel-registered buffer pool for recv operations. Features:

  • Registered with kernel via IORING_REGISTER_PBUF_RING
  • Thread-affinity enforced via Debug.Assert(IsCurrentThreadEventLoopThread()) on resize evaluation
  • Deferred recycle publish: BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop to batch PublishTail calls
  • Adaptive sizing (default OFF): runtime adjustment of buffer size based on utilization via EvaluateProvidedBufferRingResize, gated by System.Net.Sockets.IoUringAdaptiveBufferSizing AppContext switch
  • Hot-swap resize: creates a new ring with an alternating group ID (1 or 2), registers it, unregisters the old one, and disposes it
  • Resize quiescence check: requires InUseCount == 0 and _trackedIoUringOperationCount == 0 before swap
  • Registered buffer support: IORING_REGISTER_BUFFERS for fixed-buffer recv via READ_FIXED opcode

LinuxIoUringCapabilities

An immutable readonly struct snapshot captured after ring setup and stored as _ioUringCapabilities. Exposes IsIoUringPort, Mode, SupportsMultishotRecv, SupportsMultishotAccept, SupportsZeroCopySend, SqPollEnabled, SupportsProvidedBufferRings, and HasRegisteredBuffers. Eliminates scattered per-capability flag reads; the entire capability set is decided once at initialization and updated only for provided-buffer state changes.

IoUringResolvedConfiguration

An immutable readonly struct capturing all resolved configuration inputs at startup: IoUringEnabled, SqPollRequested, DirectSqeDisabled, ZeroCopySendOptedIn, RegisterBuffersEnabled, AdaptiveProvidedBufferSizingEnabled, ProvidedBufferSize, PrepareQueueCapacity, CancellationQueueCapacity. Logged once via SocketsTelemetry.Log.ReportIoUringResolvedConfiguration and NetEventSource.Info.


4. Feature Inventory

Complete Feature Stack

  1. Ring initialization with progressive flag negotiation (SQPOLL -> NO_SQARRAY -> CLOEXEC fallback via fcntl)
  2. Managed ring mmap - SQ ring, CQ ring, and SQE array mapped directly into managed address space; SINGLE_MMAP feature detected for combined SQ/CQ mapping
  3. Direct SQE writes from C# - no P/Invoke for SQE construction; managed code writes to IoUringSqe* pointers via mmap'd ring
  4. Managed CQE drain - reads completions directly from mmap'd CQ ring with batched head-advance (deferred until drain completes, except during overflow recovery)
  5. Completion mode - all socket operations submitted as io_uring ops, not epoll readiness
  6. Multishot accept (kernel 5.19+) - single SQE arms persistent accept; multishot accept state tracked via _multishotAcceptState (0=disarmed, 1=arming, otherwise encoded user_data)
  7. Multishot recv (kernel 6.0+) - persistent recv with provided buffer selection, early-data buffering via _persistentMultishotRecvDataQueue
  8. Provided buffer rings - kernel-managed buffer pool for recv, with deferred recycle publish batching
  9. Adaptive buffer sizing - runtime adjustment of provided buffer size based on utilization (defaults to OFF)
  10. Registered buffers (IORING_REGISTER_BUFFERS) - pre-registered I/O vectors for fixed-buffer recv
  11. Fixed-buffer recv (READ_FIXED) - kernel reads directly into registered buffers
  12. Zero-copy send (SEND_ZC, kernel 6.0+) - avoids kernel buffer copies for large payloads (>16KB)
  13. Zero-copy sendmsg (SENDMSG_ZC, kernel 6.1+) - zero-copy for vectored/message sends
  14. Registered files - file descriptor table registration (used for eventfd)
  15. Registered ring fd (IORING_REGISTER_RING_FD) - eliminates fget/fput on io_uring_enter itself
  16. DEFER_TASKRUN - completions processed on the event loop thread, improving cache locality
  17. SINGLE_ISSUER - kernel optimization for single-threaded submission
  18. SQPOLL (kernel 5.11+, unprivileged 5.12+) - kernel-side submission thread polls the SQ ring; mutually exclusive with DEFER_TASKRUN; requires dual opt-in (AppContext [FeatureSwitchDefinition] + env var); JIT-eliminable when switch is false
  19. EXT_ARG bounded wait - 50ms timeout on io_uring_enter for responsive event loops
  20. Eventfd cross-thread wakeup - MPSC queues + eventfd for thread-safe operation submission
  21. ASYNC_CANCEL - kernel-level cancellation of in-flight operations
  22. Opcode probing (IORING_REGISTER_PROBE) - runtime feature detection per opcode
  23. Completion slot pool - SoA arrays with 32-byte explicit layout, generation-based ABA protection
  24. 43-bit generation field - ~8.8 trillion incarnations per slot before wrap
  25. Precomputed dispatch kind - IoUringCompletionDispatchKind eliminates virtual dispatch on the CQE hot path
  26. CLOEXEC ring fd - IORING_SETUP_CLOEXEC flag with static assert in shim; fcntl fallback; dedicated test
  27. CQ overflow recovery - three-branch state machine with post-recovery stale tracked operation sweep
  28. Test hook injection - forced EAGAIN/ECANCELED results (gated behind #if DEBUG), per-opcode mask
  29. Thread-affinity assertions - [Conditional("DEBUG")] AssertSingleThreadAccess at CQE dispatch entry points; mmap offset bounds validation
  30. Comprehensive telemetry - 10 stable PollingCounters + 17 diagnostic backing fields + structured logging

5. Configuration Surface

Production Environment Variables

Variable Values Default Purpose
DOTNET_SYSTEM_NET_SOCKETS_IO_URING "1" to enable Disabled Master enable switch
DOTNET_SYSTEM_NET_SOCKETS_IO_URING_SQPOLL "1" to enable Disabled SQPOLL kernel-side polling (also requires AppContext switch)

Production AppContext Switches

Switch Name Type Default Purpose
System.Net.Sockets.UseIoUring Boolean false Master enable switch ([FeatureSwitchDefinition])
System.Net.Sockets.UseIoUringSqPoll Boolean false SQPOLL dual opt-in ([FeatureSwitchDefinition] enables JIT elimination)
System.Net.Sockets.IoUringAdaptiveBufferSizing Boolean false Adaptive provided-buffer ring sizing

Precedence: Environment variable wins over AppContext switch for the master gate. SQPOLL requires both surfaces enabled (dual opt-in).

SQPOLL dual opt-in: Both the AppContext switch AND the environment variable must be enabled. The AppContext switch is the outer gate - if false, IsSqPollRequested() returns immediately without checking the env var, and the JIT can statically eliminate all SQPOLL branches.

Debug-Only Test Controls

All DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_* environment variables are gated behind #if DEBUG:

  • TEST_DIRECT_SQE (0/1): disable/enable direct SQE submission
  • TEST_ZERO_COPY_SEND (0/1): disable/enable zero-copy send
  • TEST_REGISTER_BUFFERS: control registered buffer behavior
  • TEST_PROVIDED_BUFFER_SIZE: override provided buffer size
  • TEST_ADAPTIVE_BUFFER_SIZING (1): force adaptive sizing on
  • TEST_PREPARE_QUEUE_CAPACITY: override prepare queue capacity
  • TEST_QUEUE_ENTRIES: override SQ ring size (must be power of 2, 2-1024)
  • TEST_FORCE_EAGAIN_ONCE_MASK: comma-separated opcode names for forced EAGAIN
  • TEST_FORCE_ECANCELED_ONCE_MASK: comma-separated opcode names for forced ECANCELED

6. Safety and Correctness Measures

Fd Lifetime Management

Every direct SQE preparation takes a DangerousAddRef on the socket's SafeSocketHandle, stored in _completionSlotStorage[slotIndex].DangerousRefSocketHandle. This keeps the fd alive from SQE prep through CQE retirement, preventing fd-reuse races after close. The ref is released in FreeCompletionSlot.

Stale CQE Protection

Generation-based ABA protection. Each completion slot starts at generation 1. On free, generation increments (wrapping from 2^43-1 to 1, skipping 0). CQE dispatch compares the CQE's encoded generation against the slot's current generation; mismatches are silently dropped as stale.

Zero-Copy Send Lifecycle

SEND_ZC produces two CQEs: a data completion and a NOTIF. The slot's IsZeroCopySend and ZeroCopyNotificationPending flags track this two-phase lifecycle. After the first CQE, the slot is kept alive and the tracked operation is reattached via TryReattachTrackedIoUringOperation (generation CAS from 0 to new generation, then operation CAS from null to operation). The NOTIF CQE triggers HandleZeroCopyNotification which frees the slot and releases the pin hold.

Multishot Accept Arming

The _multishotAcceptState field uses a three-state protocol: 0 (disarmed), 1 (arming - SQE being written but user_data not yet published), or the encoded user_data value itself (armed). GetArmedMultishotAcceptUserDataForCancellation spins briefly if the arming transition is in flight.

Teardown Ordering

LinuxFreeIoUringResources follows a strict multi-phase teardown:

  1. Unregister provided buffer ring (needs ring fd)
  2. Mark registered ring fd inactive
  3. Close wakeup eventfd
  4. Unmap rings via CleanupManagedRings (also closes ring fd, terminating SQPOLL thread)
  5. Disable managed flags
  6. Drain queued operations (DrainQueuedIoUringOperationsForTeardown runs twice - once before and once after native port closure to catch late-arriving items)
  7. Drain tracked operations via DrainTrackedIoUringOperationsForTeardown
  8. Clear all aliasing pointers before NativeMemory.Free
  9. Zero all state fields and publish final diagnostics

CleanupManagedRings nulls all mmap-derived pointers before unmapping to prevent use-after-unmap.

Nullable Avoidance

The SQE retry drain path avoids wrapping SocketEventHandler (a struct) in a Nullable<T> wrapper. Presence is tracked via a separate drainHandlerInitialized boolean, avoiding boxing pressure on the hot path.

SQE Size Validation

TryGetNextManagedSqe checks ringInfo.SqeSize != (uint)sizeof(IoUringSqe) at runtime, catching 128-byte SQE kernels that would corrupt the ring. TryMmapRings additionally rejects SetupSqe128 negotiations.


7. Performance Optimizations

CQ Head Advance Batching

Outside of overflow recovery, CQ head advances are deferred: _managedCachedCqHead is incremented locally and the single Volatile.Write to *_managedCqHeadPtr happens once at the end of the drain batch (in the finally block). During overflow recovery, advances happen per-CQE to relieve kernel pressure.

SQE Zeroing

Each TryGetNextManagedSqe call writes Unsafe.WriteUnaligned(sqe, default(IoUringSqe)) for JIT-vectorized 64-byte zeroing before returning the SQE. This eliminates stale field concerns and enables each Write*Sqe method to write only the fields it needs.

SQE Writer Deduplication

Send-like operations share WriteSendLikeSqe (differing only by opcode: Send vs SendZc). Sendmsg-like operations share WriteSendMsgLikeSqe (SendMsg vs SendMsgZc). This reduces copy-paste without sacrificing readability.

SQE Acquire With Retry

TryAcquireManagedSqeWithRetry attempts up to MaxIoUringSqeAcquireSubmitAttempts (16) rounds. Between retries, it runs DrainCqeRingBatch to free CQ slots, then submits pending SQEs. The drain handler is lazily initialized to avoid struct construction on the fast path.

Completion Slot Drain Recovery

When AllocateCompletionSlot returns -1 (pool exhausted), the engine drains CQEs inline (guarded by _completionSlotDrainInProgress to prevent recursion) and retries allocation.

Provided Buffer Deferred Recycle

BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop. Buffer descriptor writes accumulate without individual Volatile.Write tail publishes. A single tail publish happens at EndDeferredRecyclePublish.

Diagnostics Polling

Diagnostic counters are polled every IoUringDiagnosticsPollInterval (64) event loop iterations, not on every CQE. Managed deltas are accumulated in per-engine fields and published in batch to SocketsTelemetry.

Lazy Lock Allocation

_multishotAcceptQueueGate and _persistentMultishotRecvDataGate on SocketAsyncContext are lazy-initialized via EnsureLockInitialized (CAS from null). Most sockets never use these paths, so the Lock objects are only allocated when needed.

Event Loop Wait

The event loop first tries a non-blocking DrainCqeRingBatch. If no CQEs are available, it issues io_uring_enter with GETEVENTS and a 50ms EXT_ARG timeout (bounded wait). This trades worst-case 50ms latency for starvation resilience when eventfd wakes are missed or deferred.


8. Telemetry and Observability

Stable PollingCounters (10)

Published when the EventSource is enabled on Linux. Counter names are centralized in IoUringCounterNames:

Counter What to watch for
io-uring-prepare-nonpinnable-fallbacks Operations that couldn't use direct preparation
io-uring-socket-event-buffer-full Event buffer capacity pressure
io-uring-cq-overflows Event loop can't keep up with kernel completions
io-uring-cq-overflow-recoveries Successful overflow recovery completions
io-uring-prepare-queue-overflows Submission queue capacity pressure
io-uring-prepare-queue-overflow-fallbacks Operations that fell back to readiness dispatch
io-uring-completion-slot-exhaustions Slot capacity pressure
io-uring-provided-buffer-depletions Provided buffer ring ran out of buffers
io-uring-sqpoll-wakeups SQPOLL kernel thread wakeups from idle
io-uring-sqpoll-submissions-skipped Zero-syscall fast path hits (SQPOLL)

Diagnostic Backing Fields (17)

Written internally for structured logging and test access. Not published as PollingCounters. Include:

  • Async cancel CQE counts
  • Completion requeue failures
  • Zero-copy notification pending slots gauge
  • Prepare queue depth
  • Completion slot drain recoveries
  • Provided buffer current size, recycles, resizes
  • Registered buffer initial/re-registration success and failure
  • Fixed recv selected/fallbacks
  • Persistent multishot recv reuse, termination, early data

Startup Events

  • ReportIoUringResolvedConfiguration: Logged once with all resolved config inputs
  • ReportSocketEngineBackendSelected (event ID 7): Reports io_uring vs. epoll selection and SQPOLL status
  • ReportIoUringSqPollNegotiatedWarning: WARNING-level when SQPOLL is negotiated

Structured Logging

IoUringDiagnostics.Linux.cs centralizes all log helpers with NetEventSource.Info/Error:

  • CQ overflow detection and recovery (with branch discriminator)
  • Async cancel prepare/submit failures (with teardown/runtime origin)
  • Queue overflow events
  • Teardown summary (benign late completion count)
  • Advanced feature state snapshot
  • Untrack mismatches

Collectible via dotnet-counters, dotnet-trace, or any OpenTelemetry-compatible collector.


9. Test Coverage

Test Access Architecture

The test project does not use InternalsVisibleTo. Instead:

  1. IoUringTestAccessors.Linux.cs (938 lines) defines all test-visible snapshot types and accessor methods inside SocketAsyncEngine (production assembly)
  2. InternalTestShims.Linux.cs (644 lines) in the test project mirrors these types and resolves them via reflection
  3. A [DynamicDependency(DynamicallyAccessedMemberTypes.All, "System.Net.Sockets.SocketAsyncEngine", "System.Net.Sockets")] attribute preserves all targets under trimming and AOT

Test Suite (132 test methods across 6,665 lines)

Coverage areas:

  • All operation types: send, recv, accept, connect, sendmsg, recvmsg
  • Completion mode vs. fallback: forced-fallback tests via environment variables
  • Per-opcode disable: env-var-driven opcode disabling for isolation
  • Forced-result injection: EAGAIN and ECANCELED injection per opcode (#if DEBUG)
  • Multishot accept: basic flow, cancellation, queue drain, dispose-during-arming race, one-shot fallback (deterministic via reflection override)
  • Multishot recv: basic iteration, cancellation, peer close, early data buffering, multishot gating by socket type (datagram exclusion)
  • Provided buffers: depletion, recycling, adaptive sizing, registered buffer toggle
  • Zero-copy send: threshold behavior, notification lifecycle, mixed mode
  • SQPOLL mode: basic send/receive, fallback, idle wakeup, multishot recv, zero-copy send, telemetry, SQ_NEED_WAKEUP contract (7 dedicated tests)
  • CQ overflow recovery: five-test suite covering all three branches
    • Test 1: inject overflow, verify telemetry counter increment and slot/op settlement
    • Test 2 (branch a): multishot accept arming during overflow - no silent drop
    • Test 3 (branch b): teardown under overflow - no deadlock within 60s
    • Test 4: DEBUG single-issuer assertion fires on non-event-loop thread
    • Test 5 (branch c): sustained 10s adversarial overflow injection with concurrent workload
  • Layout contracts: NativeMsghdrLayoutContract_IsStable and CompletionSlotLayoutContract_IsStable verify ABI alignment via reflection
  • Reflection target stability: CqOverflow_ReflectionTargets_Stable ensures field names are documented and stable
  • CLOEXEC: RingFd_HasCloexecFlag_Set verifies the FD_CLOEXEC bit via fcntl
  • ARM64 and concurrency: ARM64 MPSC stress, generation-transition stress, concurrent resize-swap
  • Cancellation: concurrent cancel/submit contention, teardown drain
  • Buffer pressure: bounded queue capacity, slot exhaustion recovery
  • Telemetry: stable counter name contract validation, counter increment verification
  • Config: dual opt-in SQPOLL validation, removed-knobs-default-enabled verification
  • Teardown: clean shutdown, resource cleanup
  • Non-pinnable fallback publication: concurrent publisher stress test via reflection shim

Hard to Test In-Process

  • True CQ overflow (requires kernel-level timing control; mitigated by managed overflow counter injection via reflection)
  • RLIMIT_MEMLOCK failures (requires container-level constraints)
  • Kernel version degradation (requires multiple kernel environments)
  • SQPOLL CPU consumption (requires system-level profiling)
  • Real-world latency distributions (requires benchmark infrastructure)

10. Graceful Degradation

Condition Behavior
Kernel < 6.1 Epoll used
Env var not set to "1" (and no AppContext switch) Epoll used
io_uring_setup fails Epoll fallback
SQPOLL not supported (EINVAL or EPERM) Flag peeled; DEFER_TASKRUN added; engine continues
NO_SQARRAY unsupported Flag peeled; SQ array identity-mapped
CLOEXEC unsupported Flag peeled; fcntl FD_CLOEXEC fallback
Opcode probe fails Advanced opcodes disabled; basic ops still work
Provided buffer ring fails Multishot recv disabled; one-shot recv with inline buffers
RLIMIT_MEMLOCK prevents buffer registration Engine continues without registered buffers
Completion slot exhaustion Drain CQEs inline; retry allocation; fall back to readiness dispatch
Prepare queue overflow Fall back to readiness dispatch for overflowed op
CQ overflow detected Three-branch recovery state machine; delayed stale sweep
SQE ring full Retry with intermediate submit + CQ drain (up to 16 attempts)
NativeMsghdr layout unsupported (non-64-bit) io_uring disabled entirely

11. Path to Default-On

  1. Opt-in environment variable (current state)
  2. Extensive testing (CI, stress tests, TechEmpower)
  3. Default-on for kernel >= 6.1 with runtime capability detection
  4. Remove the gate; io_uring is the Linux backend

SQPOLL will likely remain opt-in permanently due to its CPU cost trade-off.

Future Kernel Features

  • Incremental buffer rings (kernel 6.12+): Partial buffer consumption without full ring cycle
  • RecvSend bundles (kernel 6.10+): Single SQE performs recv then send
  • Zero-copy RX (kernel 6.7+): True zero-copy receive sharing NIC ring buffers

12. Distribution Readiness

Kernel Version Matrix

The minimum kernel cutoff is a single 6.1 requirement. All sub-features are detected at runtime via opcode probing.

Distribution Version Kernel io_uring (6.1+)
Ubuntu 24.04 LTS GA 6.8 Yes
Ubuntu 22.04 LTS GA 5.15 No (epoll fallback)
Ubuntu 22.04 LTS HWE 6.8 Yes
RHEL 10 GA 6.12 Yes
RHEL 9 GA 5.14 No (epoll fallback)
Debian 13 (Trixie) GA 6.12 Yes
Debian 12 (Bookworm) GA 6.1 Yes
Azure Linux 3 GA 6.6 Yes
Amazon Linux 2023 Default 6.1 Yes
Amazon Linux 2 Default 5.10 No (epoll fallback)

Memory Overhead

Component Size Notes
SQ ring ~16KB 1024 entries
CQ ring ~64KB 4096 entries (4x SQ)
SQE array ~64KB 1024 entries * 64B
Provided buffer pool ~4MB 1024 * 4KB default
Completion slots (hot) ~256KB 8192 slots * 32B
Completion slot storage (cold) ~varies Managed object array
Native per-slot slab ~varies NativeMemory.AllocZeroed
Zero-copy pin holds ~64KB 8192 * sizeof(MemoryHandle)
Total ~5.5MB+ Per engine instance (userspace)

13. Conclusion

This implementation delivers a complete io_uring integration with:

  • 9 partial class files totaling ~7,200 lines for the engine alone, plus 2,721 lines for per-context operation lifecycle, 1,013 lines for the provided buffer ring, 294 lines for the MPSC queue, and 687 lines for telemetry
  • 132 tests across 6,665 lines with full reflection-shim test access pattern
  • 10 stable telemetry counters plus 17 diagnostic backing fields
  • Three-branch CQ overflow recovery with post-recovery stale operation sweep
  • #if DEBUG-gated test hooks for deterministic failure injection
  • [FeatureSwitchDefinition] annotations for JIT elimination of SQPOLL branches
  • Comprehensive fd lifetime management via DangerousAddRef/DangerousRelease
  • Layout contract tests for ABI stability

The managed-ring architecture (minimal native shim + C# ring management) trades a small initial complexity cost for long-term maintainability: standard .NET breakpoints, managed stack traces, EventSource telemetry, and xUnit tests in the same language as the implementation.

The code is production-ready with the current opt-in gate. The environment variable requirement is appropriate for the initial release. Graceful degradation means unexpected issues fall back to the proven epoll path.

Copilot AI review requested due to automatic review settings February 13, 2026 11:18
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an experimental opt-in io_uring-backed socket event engine for Linux as an alternative to epoll. The implementation is comprehensive, including both readiness-based polling (Phase 1) and completion-based I/O operations (Phase 2), along with extensive testing infrastructure and evidence collection tooling.

Changes:

  • Native layer: cmake configuration, PAL networking headers, and io_uring system call integration with graceful epoll fallback
  • Managed layer: socket async engine extensions for io_uring completion handling, operation lifecycle tracking, buffer pinning, and telemetry
  • Testing: comprehensive functional tests, layout contract validation, stress tests, and CI infrastructure for dual-mode test execution
  • Tooling: evidence collection and validation scripts for performance comparison and envelope testing

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/native/libs/configure.cmake Adds CMake configuration checks for io_uring header and poll32_events struct member
src/native/libs/System.Native/pal_networking.h Defines new io_uring interop structures (IoUringCompletion, IoUringSocketEventPortDiagnostics) and function signatures
src/native/libs/System.Native/entrypoints.c Registers new io_uring-related PAL export entry points
src/native/libs/Common/pal_config.h.in Adds CMake defines for io_uring feature detection
src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs Adds layout contract tests for io_uring interop structures and telemetry counter verification
src/libraries/System.Net.Sockets/tests/FunctionalTests/System.Net.Sockets.Tests.csproj Implements MSBuild infrastructure for creating io_uring test archive variants (enabled/disabled/default)
src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs Adds comprehensive functional and stress tests for io_uring socket workflows
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs Adds 12 new PollingCounters for io_uring observability metrics
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs Implements managed wrappers for io_uring prepare operations with error handling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Core io_uring integration: submission batching, completion handling, operation tracking, and diagnostics polling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs Operation-level io_uring support: buffer pinning, user_data allocation, completion processing, and state machine
src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs Defines managed interop structures matching native layout for io_uring operations
eng/testing/io-uring/validate-collect-sockets-io-uring-evidence-smoke.sh Smoke validation script for evidence collection tooling
eng/testing/io-uring/collect-sockets-io-uring-evidence.sh Comprehensive evidence collection script for functional/perf validation and envelope testing
docs/workflow/testing/libraries/testing.md Adds references to io_uring-specific documentation
docs/workflow/testing/libraries/testing-linux-sockets-io-uring.md Detailed validation guide for io_uring backend testing
docs/workflow/testing/libraries/io-uring-pr-evidence-template.md PR evidence template for documenting io_uring validation results

Copilot AI review requested due to automatic review settings February 13, 2026 11:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings February 13, 2026 12:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings February 13, 2026 14:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings February 14, 2026 01:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 20 changed files in this pull request and generated 7 comments.

Copilot AI review requested due to automatic review settings February 14, 2026 05:21
Copilot AI review requested due to automatic review settings February 19, 2026 14:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 31 changed files in this pull request and generated no new comments.

…etry batching, packed capabilities, and native shim hardening

Merge-blocking:
- Set SOCK_CLOEXEC | SOCK_NONBLOCK on accept SQEs via AcceptFlags constant
- Add EPERM to IsIgnoredIoUringSubmitError and drain rejected SQEs as failed
  completions instead of re-queuing them (breaks infinite-retry spin)
- Replace SEND_ZC reattach FailFast with Debug.Fail, slot cleanup, and error
  completion; add generation check in HandleZeroCopyNotification

Performance:
- Batch per-CQE telemetry (depletion/recycle/early-data) into drain-batch
  accumulators flushed once per DrainCqeRingBatch
- Replace Interlocked.CompareExchange pair in TryTrackPreparedIoUringOperation
  with Volatile.Read/Write (event-loop-only path)
- Move 5 permanently-true SQE ring invariant checks to one-time init validation
- Convert CQE tag dispatch from switch to if-chain for branch prediction
- Clear only NativeMsghdr header instead of full native storage stride
- Copy buffered multishot recv data outside lock

Capabilities and telemetry:
- Convert LinuxIoUringCapabilities from 8-bool positional struct to packed
  uint flags with fluent With* mutators
- Add slot high-water-mark and cancellation-queue-overflow production counters
- Add capacity planning comments near SlotIndexBits
- Add Debug.Assert on non-EXT_ARG fallback path

Security and resilience:
- Add BitOperations.IsPow2 asserts on kernel-reported ring sizes in TryMmapRings
- Add c_static_assert(sizeof(size_t) >= 8) in native shim
- Add ringFd < 0 validation at all native shim entry points
- Wrap DangerousRelease in try/finally in FreeCompletionSlot
- Block provided-buffer resize during CQ overflow recovery
- Fix FreeIoUringProvidedBufferRing transient inconsistent capability state
- Guard Dispose against freeing registered ring memory
- Replace _persistentMultishotRecvDataQueueCount with computed property
- Use Volatile.Write for teardown TrackedOperationGeneration clear
- Silently ignore TagNone CQEs from ASYNC_CANCEL completions
- Add EINTR comment on native shim CloseFd

Tests:
- Add accepted-socket FD_CLOEXEC and O_NONBLOCK verification test
- Add forced-submit-EPERM graceful degradation test
…buffer ring group ID hardening, sweep re-arm cap, wake circuit-breaker, and test coverage

MpscQueue:
- Co-locate Items and States into single SegmentEntry[] array for cache locality
- Add TryEnqueue with bounded retry (MaxEnqueueSlowAttempts=2048) and SpinWait
  backoff; catch OOM in RentUnlinkedSegment
- Handle TryEnqueue failure at prepare-queue and cancel-queue call sites
- Remove AggressiveInlining from lock-containing Rent/ReturnUnlinkedSegment
- Promote ARM64 concurrent stress test from OuterLoop to regular CI

Code quality:
- Collapse redundant WriteSendSqe/WriteSendZcSqe and WriteSendMsgSqe/
  WriteSendMsgZcSqe wrappers; call WriteSendLikeSqe/WriteSendMsgLikeSqe directly
- Replace string-based telemetry test hook dispatch with IoUringCounterFieldForTest
  enum for compile-time safety
- Centralize 11 debug test env vars into IoUringTestEnvironmentVariables class
- Move s_ioUringResolvedConfigurationLogged to per-engine instance field
- Add SQE zeroing socket-only assumption comment

Resilience:
- Replace fragile group ID toggle (1/2) with sequential allocation starting at
  0x8000 to avoid collision with other io_uring users
- Cap CQ overflow stale-tracked sweep re-arms at 8 with diagnostic log
- Add eventfd wake failure circuit-breaker: after 8 consecutive failures, reduce
  completion wait timeout from 50ms to 1ms; reset on successful wake
- Null out multishot accept sockaddr pointer to eliminate shared-buffer race

Performance:
- Add _nextPreparedReceivePostedWordHint for O(1) common-case bitset search in
  TryAcquireBufferForPreparedReceive; update hint on recycle and acquisition
- Remove AggressiveInlining from IsProvidedBufferResizeQuiescent
- Bound TryAcquireBufferForPreparedReceive retry by word count instead of ring size

Tests:
- CQ overflow recovery with zero tracked operations
- Wakeup eventfd FD_CLOEXEC verification
- SQPOLL + DEFER_TASKRUN mutual exclusivity assertion
- NativeMsghdr 32-bit rejection path
- UDP oversized datagram with zero-length ReceiveFrom buffer
- CounterDelta monotonicity assertion (replaces silent underflow)
- Clarify zero-copy small-buffer test name and forced-error intent
…oA split, cancellation batching, configuration centralization, registered ring fd EINVAL fallback, and test coverage

- Convert static io_uring counters to per-engine instance fields with aggregation
- Group 20+ managed ring mmap fields into ManagedRingState struct with property accessors
- Split TrackedOperation/TrackedOperationGeneration into separate IoUringTrackedOperationState array for cache locality; shrink IoUringCompletionSlot from 32 to 24 bytes
- Batch ProcessCancellation ThreadPool callbacks via static ConcurrentQueue with cooperative worker drain
- Replace ConcurrentQueue<SocketIOEvent> with MpscQueue on Linux via SocketIOEventQueue wrapper
- Centralize configuration resolution into IoUringConfigurationInputs with contradiction validation warnings
- Collapse CounterPair struct into static TryPublishManagedCounterDelta method
- Add registered ring fd EINVAL fallback on all four io_uring_enter call sites (submit, SQPOLL wakeup, EXT_ARG wait, non-EXT_ARG wait)
- Treat kernel EINVAL from submit as drainable error; convert internal invariant violations to ThrowInternalException(string) to bypass drain
- Add MpscQueue drained-segment recycling with slow-path-only producer quiescence tracking
- Add provided-buffer ring OOM test hook and EINTR retry limit test hook in native shim
- Replace ThrowInternalException with Debug.Fail at unreachable/defensive sites in slots and dispatch
- Add tests for generation wrap-around dispatch, fork/exec close-on-exec, queue saturation, slot capacity stress, kernel version fallback, cancellation routing, and MpscQueue OOM recovery
Copilot AI review requested due to automatic review settings February 19, 2026 19:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 32 changed files in this pull request and generated 1 comment.

Comment on lines +368 to +370
// Layout assertions for managed interop structs (kernel struct mirrors).
c_static_assert(sizeof(size_t) >= 8);
c_static_assert(sizeof(size_t) == sizeof(void*));
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c_static_assert(sizeof(size_t) >= 8) (and the following pointer-size asserts) will fail compilation on 32-bit Linux, even if io_uring is meant to be disabled there. Consider gating SHIM_HAVE_IO_URING (or just these layout asserts) on 64-bit (e.g., __SIZEOF_POINTER__ == 8) so System.Native still builds for 32-bit targets and the shim can fall back to the stub implementations.

Suggested change
// Layout assertions for managed interop structs (kernel struct mirrors).
c_static_assert(sizeof(size_t) >= 8);
c_static_assert(sizeof(size_t) == sizeof(void*));
// Layout assertions for managed interop structs (kernel struct mirrors).
#if defined(__SIZEOF_POINTER__) && __SIZEOF_POINTER__ == 8
c_static_assert(sizeof(size_t) >= 8);
c_static_assert(sizeof(size_t) == sizeof(void*));
#endif

Copilot uses AI. Check for mistakes.
…submitter_task, drain all non-EFAULT submit errors, and test coverage
…AULT submit errors, EINVAL registered-ring-fd fallback, and source-specific error context
Copilot AI review requested due to automatic review settings February 19, 2026 20:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 32 changed files in this pull request and generated no new comments.

if (next is not null)
{
Interlocked.CompareExchange(ref _tail.Value, next, tail);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming that _tail.Value is eventually consistent? Otherwise, I believe this scenario could end up with the tail having some invalid value,

At the top you grab the current tail,

Segment tail = Volatile.Read(ref _tail.Value)!;

Then if the entry array is full you continue to create a new tail. If that fails you will "refresh" the next variable to the current next.

next = Volatile.Read(ref tail.Next);

Now, assume that your thread context switches out at this point and some other thread(s) enqueues a bunch of items that causes a new tail to be added.

Then we context switch back in and since tail and _tail.Value are not the same you will set _tail.Value to next but the value of next points to the previous tail.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this was this part of the description

Segment recycling limited to segments that lost the tail-link CAS race (never previously published), avoiding need for producer quiescence tracking

while (true)
{
Segment tail = Volatile.Read(ref _tail.Value)!;
int index = Interlocked.Increment(ref tail.EnqueueIndex.Value) - 1;
Copy link

@deathly809 deathly809 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my comment below (or above if reading from discussion page) could we cause a certain race condition that keeps resetting _tail.Value to a previous value and if we keep incrementing this value, we hit an integer overflow?

{
get
{
Segment head = Volatile.Read(ref _head.Value)!;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty computationally heavy; is there a reason you just can't have a single _count variable that you atomically increment/decrement that you just check for 0 here?

fixedRecvBufferId,
ref completionAuxiliaryData))
{
completionResultCode = -Interop.Sys.ConvertErrorPalToPlatform(Interop.Error.ENOBUFS);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the negation? I see you do it below as well. I did a quick search around the repo and only saw this referenced in one other place and they did not do negation and the folks referencing that code don't appear to being a negation either.

int32_t state = atomic_load_explicit(&s_forceEnterEintrRetryLimitOnce, memory_order_relaxed);
if (state < 0)
{
const char* configuredValue = getenv(SHIM_TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE_ENV);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be behind a #ifdef DEBUG?

private const string ConnectActivityName = ActivitySourceName + ".Connect";
private static readonly ActivitySource s_connectActivitySource = new ActivitySource(ActivitySourceName);

internal static class Keywords

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe IoUringKeywords would be a better description.

#if DEBUG
// Test-only knob to make wait-buffer saturation deterministic for io_uring diagnostics coverage.
// Only available in DEBUG builds so production code never reads test env vars.
if (OperatingSystem.IsLinux())
Copy link

@deathly809 deathly809 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also check DOTNET_SYSTEM_NET_SOCKETS_IO_URING or do we assume that DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_EVENT_BUFFER_COUNT is only set when the feature flag is enabled?

try
{
RecordAndAssertEventLoopThreadIdentity();
LinuxEventLoopEnableRings();
Copy link

@deathly809 deathly809 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if these could be more generic. i.e.

LinuxEventLoopEnableRings -> EventLoopInit
LinuxEventLoopBeforeWait -> EventLoopBeforeWait
LinuxEventLoopTryCompletionWait -> EventLoopTryCompleteWait
etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess it would be an issue if someone wanted to add their own "EventLoopInit" or equivalent for the other methods :)

}
else
{
Debug.Assert(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we have not tested this on Kernels before 6.1?

{
// Snapshot the wakeup generation counter before entering the blocking syscall.
// After waking, we compare to detect wakeups that arrived during the syscall.
uint wakeGenBefore = Volatile.Read(ref _ioUringWakeupGeneration);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to need to define this outside the if statement so you can reference it after the if/else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Net.Sockets community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments