feat: Tell Load Balancers to Back Off by mterhar · Pull Request #1530 · honeycombio/refinery

mterhar · 2025-04-18T00:42:00Z

Short description of the changes

Sometimes you have a massive cluster. This cluster has hundreds of nodes. The load balancer sending in traffic is like, "I'm gonna send it all to this one node!"

Well, ALBs will respect the "Retry-After", so even if clients don't, you can throw a 503 real quick and the retry will get routed to a less stressed node in the cluster.

This may be problematic if you have a retry limit of 3 and your client doesn't respect that header, it may hit 3 nodes that are hotter than average and then you're dropping data.

OpenTelemetry Exporters should respect the retry-after header but will definitely retry after a 503 response code.

Implementation

This is behind a configuration that won't be enabled by default.

If you have stress relief in monitor, you'll have access to the metrics and can enable it with the following configuration:

StressRelief:
  Mode: monitor
  ActivationLevel: 90
  DeactivationLevel: 80
  SamplingRate: 10
  MinimumActivationDuration: 30s
  MinimumStartupDuration: 10s
  InboundRejectionServer: incoming
  InboundRejectionTolerance: 5
  RetryAfterSeconds: 5
  StatusTooManyRequests: 429

Configuration ergonomics are here because this will need to be tuned live since production traffic only hits production deployments. The three operative configurations are:

InboundRejectionServer: incoming

This has options to be set to none, all, incoming, or peer. Peer is optimistic but probably shouldn't be used unless we fix up libhoney-go so it will respect the Retry-After header. It's certainly possible to do, but isn't in scope for this PR.

Enabling it on incoming will allow the StressCheck middleware to fire and evaluate the situation before engaging the parsers and storage.

InboundRejectionTolerance: 5

This allows you to make your individual pods deal run more lopsided to reduce 503 errors. If your cluster runs really hot and you need the load balancer on point, like 100+ nodes pushing gigabytes per second, this should be like 1 or 2. If your cluster is lower stress or scaled for spikes, you can increase it to 10 or 15.

RetryAfterSeconds: 5

This one sets the Retry-After header to a number of seconds so that any clients that respect it will wait that long and the load balancer will lay off the node for that long.

Magical behavior: If you set it to zero, it will use the TraceTimeout setting since that would logically be when a lot of the load will be gone. This is probably excessive in most cases, but for VERY LARGE and VERY LOPSIDED deployments, it may make sense.

StatusTooManyRequests: 429

This one sets the response code. If all your clients are OTLP, they should respect retry-after and you don't actually need to fool a load balancer.

If your clients are Libhoney and/or custom things, and you want to rely on the load balancer, you can change this to 503.

Any other codes will not be used and it will default to 429.

Failure types

Client is set to not retry. Data loss
Client retries and several nodes in the cluster all reject the spans because they all think it's a fresh request and they're otherwise busy. Data loss
All of the nodes get stressed out a bit and use the Retry-After which prevents them from flipping over into stress relief mode, but the LB thinks all the nodes are broken and refuses to send data to any of them? Hopefully the RejectionTolerance will prevent this but as things go into stress mode and their stress levels start falling, it could be a situation where the average stress drops and every node individually tells the LB it's busy.

Alternatives

I noticed that the mini-load-balancing in #1525 doesn't seem to be sufficient to offload spans and we need the load balancer to participate.

verythorough · 2025-05-06T16:03:40Z

Pulling this out of triage (and the Pipeline board in general), since it's now referenced in multiple issues in the board.

mterhar added 4 commits April 17, 2025 18:02

add reject and retry-after to router

dcbb157

add reject and retry-after to router

cb71365

fix the word all

7bcb590

allow either 429 or 503 status codes

96c002b

VinozzZ mentioned this pull request Apr 18, 2025

Stability Improvements: OOMKill Prevention and Stress Relief Tuning #1533

Closed

VinozzZ mentioned this pull request May 2, 2025

Apply back pressure to upstream service when Refinery is stressed #1553

Open

MikeGoldsmith mentioned this pull request May 14, 2025

feat: Add support for returning backoff when stressed #1560

Draft

MikeGoldsmith added the type: enhancement New feature or request label May 16, 2025

MikeGoldsmith assigned mterhar May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Tell Load Balancers to Back Off#1530

feat: Tell Load Balancers to Back Off#1530
mterhar wants to merge 4 commits intomainfrom
mterhar.reject-retry-after

mterhar commented Apr 18, 2025 •

edited

Loading

Uh oh!

verythorough commented May 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mterhar commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short description of the changes

Implementation

InboundRejectionServer: incoming

InboundRejectionTolerance: 5

RetryAfterSeconds: 5

StatusTooManyRequests: 429

Failure types

Alternatives

Uh oh!

verythorough commented May 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mterhar commented Apr 18, 2025 •

edited

Loading