Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
There are some issues with our current standing pacing mechanism that lead to request starvation with slow networks/low requested bandwidth. First, requests get flighted based upon the number of currently live threads. If a bandwidth limit is applied, request bodies are wrapped and request bytes by pulling them from a frequently refreshed "token bucket", an atomic int64 with a worker routine constantly feeding it bytes. Due to this first-come-first-serve basis, if there's too many requests, some requests will never be able to get bandwidth allocated in a reasonable amount of time, which will result in the request getting dropped.
To that end, the pacing mechanism has been reworked slightly. Another vector of gating is applied in that we prevent requests from going out the door until we're certain we can allocate them a reasonable minimum amount of bandwidth. Atop this, bandwidth allocation is no longer handled first-come-first-serve but instead is handled in a structured manner. Requests that have been given the go-live receive amounts of bandwidth, as a portion of the requested maximum bandwidth.
So far, this does seem to be working to correctly limit bandwidth, but, there's some issues with my implementation I'm working through still. Hence, this lives in draft status to demonstrate work done.
Still, too many requests can be let through the gates-- while I don't get service timeouts anymore under the new code, the requests do live for... quite a while, and sometimes wind up having their network connections get closed (property of network equipment in between?) Either way, less than optimal. The mechanism for gating requests is very immature and was a naiive implementation where we'd try to divvy things up close to Azure's minimum allowed throughput. My plan for this is to graft the concurrency tuner onto this, and combine the efforts somewhat, given it also acts on a mechanism of scaling against concurrent requests. Combining these efforts also comes with the benefit of not having a bunch of effectively dead threads waiting around.
DONE, as of 5fda4d7
Retries-- Because the new pacer was grafted directly over the old code, a request body continues to live through multiple retries, resulting in the stream getting seeked back to 0, and bypassing the request gating mechanism. This is a simple enough fix, fortunately. I'm inclined to believe I can apply this gating via a post retry policy, and it should work fine.Hiccups in throughput-- Because of the mechanism this acts on, many requests wind up completing at the same time, and the ticker takes another second to accept new requests. The main thing I'm thinking here is that I should add some randomness to the allocations, such that requests complete at different times and sort of blend together for the sake of throughput consistency. There's another case I've observed where there's maybe a bug with the early clear mechanism that can lead to an overage of the requested bandwidth limit.
Debating on whether or not to keep the bandwidth detector. The implementation _works, but is kinda unnecessary compared to just using AzCopy's native bandwidth detection. I'd expected that something a bit more real-time would be useful for this, but I'm not so sure at this point.
Related Links
Type of Change
How Has This Been Tested?
Currently, very manual testing. The bucket rotator / bandwidth detector has some formal unit tests, but I'm not sure whether that component is staying.