Skip to content

cap-mbps redux#3378

Draft
adreed-msft wants to merge 2 commits intomainfrom
adreed/cap-mbps-redux
Draft

cap-mbps redux#3378
adreed-msft wants to merge 2 commits intomainfrom
adreed/cap-mbps-redux

Conversation

@adreed-msft
Copy link
Copy Markdown
Member

Description

There are some issues with our current standing pacing mechanism that lead to request starvation with slow networks/low requested bandwidth. First, requests get flighted based upon the number of currently live threads. If a bandwidth limit is applied, request bodies are wrapped and request bytes by pulling them from a frequently refreshed "token bucket", an atomic int64 with a worker routine constantly feeding it bytes. Due to this first-come-first-serve basis, if there's too many requests, some requests will never be able to get bandwidth allocated in a reasonable amount of time, which will result in the request getting dropped.

To that end, the pacing mechanism has been reworked slightly. Another vector of gating is applied in that we prevent requests from going out the door until we're certain we can allocate them a reasonable minimum amount of bandwidth. Atop this, bandwidth allocation is no longer handled first-come-first-serve but instead is handled in a structured manner. Requests that have been given the go-live receive amounts of bandwidth, as a portion of the requested maximum bandwidth.

So far, this does seem to be working to correctly limit bandwidth, but, there's some issues with my implementation I'm working through still. Hence, this lives in draft status to demonstrate work done.

  1. Still, too many requests can be let through the gates-- while I don't get service timeouts anymore under the new code, the requests do live for... quite a while, and sometimes wind up having their network connections get closed (property of network equipment in between?) Either way, less than optimal. The mechanism for gating requests is very immature and was a naiive implementation where we'd try to divvy things up close to Azure's minimum allowed throughput. My plan for this is to graft the concurrency tuner onto this, and combine the efforts somewhat, given it also acts on a mechanism of scaling against concurrent requests. Combining these efforts also comes with the benefit of not having a bunch of effectively dead threads waiting around.

  2. DONE, as of 5fda4d7 Retries-- Because the new pacer was grafted directly over the old code, a request body continues to live through multiple retries, resulting in the stream getting seeked back to 0, and bypassing the request gating mechanism. This is a simple enough fix, fortunately. I'm inclined to believe I can apply this gating via a post retry policy, and it should work fine.

  3. Hiccups in throughput-- Because of the mechanism this acts on, many requests wind up completing at the same time, and the ticker takes another second to accept new requests. The main thing I'm thinking here is that I should add some randomness to the allocations, such that requests complete at different times and sort of blend together for the sake of throughput consistency. There's another case I've observed where there's maybe a bug with the early clear mechanism that can lead to an overage of the requested bandwidth limit.

  4. Debating on whether or not to keep the bandwidth detector. The implementation _works, but is kinda unnecessary compared to just using AzCopy's native bandwidth detection. I'd expected that something a bit more real-time would be useful for this, but I'm not so sure at this point.

Related Links

  • ADO WI # 33075219

Type of Change

  • Bug fix
  • New feature
  • Documentation update required
  • Code quality improvement
  • Other (describe):

How Has This Been Tested?

Currently, very manual testing. The bucket rotator / bandwidth detector has some formal unit tests, but I'm not sure whether that component is staying.

@adreed-msft adreed-msft changed the title Adreed/cap mbps redux cap-mbps redux Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant