Skip to content

feat: Tell Load Balancers to Back Off#1530

Draft
mterhar wants to merge 4 commits intomainfrom
mterhar.reject-retry-after
Draft

feat: Tell Load Balancers to Back Off#1530
mterhar wants to merge 4 commits intomainfrom
mterhar.reject-retry-after

Conversation

@mterhar
Copy link
Contributor

@mterhar mterhar commented Apr 18, 2025

Short description of the changes

Sometimes you have a massive cluster. This cluster has hundreds of nodes. The load balancer sending in traffic is like, "I'm gonna send it all to this one node!"

Well, ALBs will respect the "Retry-After", so even if clients don't, you can throw a 503 real quick and the retry will get routed to a less stressed node in the cluster.

This may be problematic if you have a retry limit of 3 and your client doesn't respect that header, it may hit 3 nodes that are hotter than average and then you're dropping data.

OpenTelemetry Exporters should respect the retry-after header but will definitely retry after a 503 response code.

Implementation

This is behind a configuration that won't be enabled by default.

If you have stress relief in monitor, you'll have access to the metrics and can enable it with the following configuration:

StressRelief:
  Mode: monitor
  ActivationLevel: 90
  DeactivationLevel: 80
  SamplingRate: 10
  MinimumActivationDuration: 30s
  MinimumStartupDuration: 10s
  InboundRejectionServer: incoming
  InboundRejectionTolerance: 5
  RetryAfterSeconds: 5
  StatusTooManyRequests: 429

Configuration ergonomics are here because this will need to be tuned live since production traffic only hits production deployments. The three operative configurations are:

InboundRejectionServer: incoming

This has options to be set to none, all, incoming, or peer. Peer is optimistic but probably shouldn't be used unless we fix up libhoney-go so it will respect the Retry-After header. It's certainly possible to do, but isn't in scope for this PR.

Enabling it on incoming will allow the StressCheck middleware to fire and evaluate the situation before engaging the parsers and storage.

InboundRejectionTolerance: 5

This allows you to make your individual pods deal run more lopsided to reduce 503 errors. If your cluster runs really hot and you need the load balancer on point, like 100+ nodes pushing gigabytes per second, this should be like 1 or 2. If your cluster is lower stress or scaled for spikes, you can increase it to 10 or 15.

RetryAfterSeconds: 5

This one sets the Retry-After header to a number of seconds so that any clients that respect it will wait that long and the load balancer will lay off the node for that long.

Magical behavior: If you set it to zero, it will use the TraceTimeout setting since that would logically be when a lot of the load will be gone. This is probably excessive in most cases, but for VERY LARGE and VERY LOPSIDED deployments, it may make sense.

StatusTooManyRequests: 429

This one sets the response code. If all your clients are OTLP, they should respect retry-after and you don't actually need to fool a load balancer.

If your clients are Libhoney and/or custom things, and you want to rely on the load balancer, you can change this to 503.

Any other codes will not be used and it will default to 429.

Failure types

  1. Client is set to not retry. Data loss
  2. Client retries and several nodes in the cluster all reject the spans because they all think it's a fresh request and they're otherwise busy. Data loss
  3. All of the nodes get stressed out a bit and use the Retry-After which prevents them from flipping over into stress relief mode, but the LB thinks all the nodes are broken and refuses to send data to any of them? Hopefully the RejectionTolerance will prevent this but as things go into stress mode and their stress levels start falling, it could be a situation where the average stress drops and every node individually tells the LB it's busy.

Alternatives

I noticed that the mini-load-balancing in #1525 doesn't seem to be sufficient to offload spans and we need the load balancer to participate.

@verythorough
Copy link
Contributor

Pulling this out of triage (and the Pipeline board in general), since it's now referenced in multiple issues in the board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants