Merged
Conversation
landonxjames
approved these changes
Mar 3, 2026
| self.pending.pop_front() | ||
| } | ||
|
|
||
| pub(super) fn mark_in_flight(&mut self) { |
Contributor
There was a problem hiding this comment.
Looks like mark_in_flight is always called when pop returns Some. Could it be eliminated in favor of pop doing the update so callers don't have to remember to manually update the in_flight count?
Contributor
Author
There was a problem hiding this comment.
I chose to keep them separate, it allows the execution layer to dequeue without marking it in-flight (e.g. to batch).
ysaito1001
reviewed
Mar 3, 2026
Contributor
|
To aid code review (for me at least), here are Kiro-generated diagrams showing how the scheduler submodules interact: (static view)Flow:
single download transfer lifecycle (runtime view)Critical flow points:
|
ysaito1001
approved these changes
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
vNext Scheduler
What this PR adds
A new scheduler module (
scheduler/) and its design doc (docs/design/scheduler.md). The scheduler is the core coordination layer for the transfer manager redesign. It holds transfers, polls them for work, controls ordering and admission, and submits work for execution. Nothing in the existing upload/download paths changes -- the new scheduler sits alongside the existingruntime::schedulerand is not yet wired into any operations.This is the first PR in a series that lands the redesign incrementally:
vnext-upload: Upload state machine on new schedulervnext-download: Download state machine + seq window backpressurevnext-integration: Wire into Handle, remove old code pathsWhy
The current scheduler (
runtime::scheduler) is a token-bucket concurrency limiter. It gates how many tasks can run concurrently, but has no opinion about which tasks run, in what order, or how to balance work across transfers. Concurrency is controlled via Tower middleware (ConcurrencyLimitLayer), and work is spawned eagerly into aJoinSet. Once spawned, we lose control over ordering, priority, and cancellation.This works for the simple case (one transfer, fixed concurrency), but breaks down for:
CRT's scheduler (aws-c-s3) solves some of these -- it uses a state machine vtable pattern and runs scheduling on a dedicated event loop -- but has no priority system and uses fixed concurrency derived from a throughput target.
Design
The design doc (rendered) has the full design. tl;dr:
Transfers as state machines. Each transfer implements
trait Transferwithpoll_work()andexecute(). The scheduler polls transfers for work when capacity is available. Transfers produce work lazily -- a 10,000-part upload generates one work item per poll. This naturally bounds memory and in-flight work without the scheduler knowing anything about the operation's internals.CFS fair scheduling. Adapted from the Linux kernel's Completely Fair Scheduler. Each transfer accumulates virtual runtime as it generates work. The ready set is ordered by vruntime, so the scheduler always picks the transfer that has received the least scheduling share. Priority acts as a weight on accumulation rate -- higher priority means slower accumulation, more work before yielding. A priority-1 transfer still makes progress because its vruntime stays low while it waits.
Edge-triggered ready set. Transfers enter the ready set when enqueued or woken. They leave when
poll_work()returnsPendingorDone. The scheduler never polls a transfer that returnedPendinguntil something explicitly wakes it. Scheduling cost scales with active transfers, not total transfer count.Capacity gating via ConcurrencyController trait. The scheduler checks
controller.target()before polling. AFixedConcurrencycontroller returns a constant. An adaptive controller (future PR) observes throughput and adjusts. The scheduler doesn't care which -- it just asks "do I have capacity?"Follow-on work bypasses CFS. When a work item completes and produces a successor (e.g., disk read completes, now send over network), the successor goes directly to execution. Re-entering the ready set would mean completed disk reads sit in buffers waiting behind other transfers. CFS controls admission; successors complete admitted work.
Execution layer is an abstraction boundary. The scheduler generates work and receives completions. It doesn't know how work runs. The current implementation uses a worker pool with
tokio::spawnon the multi-threaded runtime -- the closest analog to main'sJoinSet-based execution. The boundary is deliberate: the scheduler's internals (CFS, ready set, capacity tracking) don't change if the execution layer moves to managed threads with current-thread runtimes, or something else entirely.What's in the module
The
Handlestruct now carries both schedulers:Existing upload/download operations use
legacy_scheduler. The new scheduler is available but not called by any operation yet.Benchmarks
Benchmarked on c6in.16xlarge (64 vCPU, 100 Gbps NIC). Single 30 GiB download to RAM, fixed concurrency at 128.
The "pending/run ratio" is how many times
poll_work()returnedPendingfor every time it returnedReadywith actual work. A high ratio means churn: the transfer keeps getting polled, keeps saying "not ready yet", gets woken on the next completion, and the cycle repeats. A ratio near 1 means almost every poll produces work.Resource Usage: Main vs New Scheduler, seq_gap=512

The gap=16 result was the key finding: with a seq window too small relative to concurrency, transfers churn between Pending and Ready on every completion, and the scheduler spends all its time waking and re-polling instead of generating work. At gap=512 (4x concurrency), the churn disappears and throughput matches main while using 60% less memory.
With 128 concurrency and 8 MB parts, the theoretical minimum in-flight memory is ~1 GB (every body buffer full, nothing queued). The new scheduler's 4.6 GB reflects that plus sequencing buffers. Main's 11.8 GB reflects unbounded out-of-order buffering with no backpressure from the scheduler.
Scheduler overhead per work item is ~10us, under 1% of wall time at real throughput. At gap=512, the new scheduler matches main's throughput while using 60% less memory.
What's NOT in this PR
AdaptiveConcurrencycontroller)