Skip to content

Fix null sqe crash in ReadAsync when io_uring submission queue is full#14521

Open
archang19 wants to merge 1 commit intofacebook:mainfrom
archang19:export-D98533853
Open

Fix null sqe crash in ReadAsync when io_uring submission queue is full#14521
archang19 wants to merge 1 commit intofacebook:mainfrom
archang19:export-D98533853

Conversation

@archang19
Copy link
Copy Markdown
Contributor

Summary:
ReadAsync calls io_uring_get_sqe() without checking for nullptr. When the
io_uring submission queue is full (outstanding completions not yet reaped),
io_uring_get_sqe returns NULL and the subsequent io_uring_prep_readv
dereferences it, causing a segfault.

MultiRead already handles this correctly by using io_uring_sq_space_left()
to cap submissions. ReadAsync submits exactly one SQE per call so a simple
null check with error return is sufficient.

On null sqe, clean up the already-allocated Posix_IOHandle and return
IOStatus::Busy so the caller can retry after reaping completions.

The io_uring queue depth is kIoUringDepth (256), and each thread gets its
own io_uring instance via thread-local storage. In practice the SQ rarely
fills because ReadAsync calls io_uring_submit() after each io_uring_get_sqe(),
immediately flushing the SQE to the kernel. The null SQE would only occur
under unusual kernel backpressure where the kernel cannot consume from the
SQ ring fast enough.

IOStatus::Busy was chosen (over IOError) because this is a transient
condition. The caller has two options:

  1. Call Poll() to reap outstanding completions from the CQ, then retry
    ReadAsync. This mirrors how MultiRead handles queue pressure internally
    by capping submissions and reaping between batches.
  2. Fall back to synchronous Read(). Existing callers (FilePrefetchBuffer,
    IODispatcher) already have synchronous fallback paths for non-OK
    ReadAsync status, so IOStatus::Busy naturally triggers that fallback
    without additional code changes. Given the rarity of this condition,
    the synchronous fallback is pragmatic and avoids adding retry complexity.

Also adds a TEST_SYNC_POINT_CALLBACK on io_uring_get_sqe to enable test
injection, and a new ReadAsyncQueueFull unit test that uses SyncPoint to
force a null SQE and verifies the Busy return, handle cleanup, and no crash.

Differential Revision: D98533853

Summary:
ReadAsync calls io_uring_get_sqe() without checking for nullptr. When the
io_uring submission queue is full (outstanding completions not yet reaped),
io_uring_get_sqe returns NULL and the subsequent io_uring_prep_readv
dereferences it, causing a segfault.

MultiRead already handles this correctly by using io_uring_sq_space_left()
to cap submissions. ReadAsync submits exactly one SQE per call so a simple
null check with error return is sufficient.

On null sqe, clean up the already-allocated Posix_IOHandle and return
IOStatus::Busy so the caller can retry after reaping completions.

The io_uring queue depth is kIoUringDepth (256), and each thread gets its
own io_uring instance via thread-local storage. In practice the SQ rarely
fills because ReadAsync calls io_uring_submit() after each io_uring_get_sqe(),
immediately flushing the SQE to the kernel. The null SQE would only occur
under unusual kernel backpressure where the kernel cannot consume from the
SQ ring fast enough.

IOStatus::Busy was chosen (over IOError) because this is a transient
condition. The caller has two options:
1. Call Poll() to reap outstanding completions from the CQ, then retry
   ReadAsync. This mirrors how MultiRead handles queue pressure internally
   by capping submissions and reaping between batches.
2. Fall back to synchronous Read(). Existing callers (FilePrefetchBuffer,
   IODispatcher) already have synchronous fallback paths for non-OK
   ReadAsync status, so IOStatus::Busy naturally triggers that fallback
   without additional code changes. Given the rarity of this condition,
   the synchronous fallback is pragmatic and avoids adding retry complexity.

Also adds a TEST_SYNC_POINT_CALLBACK on io_uring_get_sqe to enable test
injection, and a new ReadAsyncQueueFull unit test that uses SyncPoint to
force a null SQE and verifies the Busy return, handle cleanup, and no crash.

Differential Revision: D98533853
@meta-cla meta-cla bot added the CLA Signed label Mar 27, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 27, 2026

@archang19 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98533853.

@github-actions
Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 130.1s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant