Skip to content

Conversation

@charlesma4
Copy link
Contributor

Description

  • Add configurable HTTP probe workers for backend health checking
  • Backends with failing probes are excluded from load balancing
  • Add graceful drain support: /healthz returns 503 when draining
  • Probe config: probe_url, probe_*_threshold, probe_period_seconds

Original PR description:

Restarting op-stack components can generate unnecessary error noise because traffic may still be routed to containers that are about to be stopped. To ensure a cleaner shutdown process for proxyd and backend services, we’ve introduced an optional probe URL for each backend. When the probe fails, the backend is immediately removed from load balancing.

Additionally, a shutdown drain mechanism has been implemented. Upon receiving a SIGTERM, the probe will fail immediately, and the shutdown process will be delayed by 7 seconds. This gives the load balancer enough time to detect the probe failure and remove the backend from rotation before the container shuts down.

We assume the load balancer is configured with a probe interval of no more than 5 seconds, a failure threshold of 1, and a timeout of 1 second.

Tests

Added testing for common healthcheck results based on server responses - success, failure, and redirect

Additional context

From original PR:

When launching proxyd via Docker Compose, the default stop_grace_period is 10 seconds. This should be increased to at least 15 seconds—or ideally 30 seconds—to align with the default values used by Kubernetes and ECS. Otherwise, the container risks being forcefully terminated with a SIGKILL before the drain logic has a chance to complete.

Metadata

  • Fixes #[Link to Issue]

@charlesma4 charlesma4 requested a review from a team as a code owner December 23, 2025 14:37
@wiz-inc-a178a98b5d
Copy link

wiz-inc-a178a98b5d bot commented Dec 23, 2025

Wiz Scan Summary

Scanner Findings
Vulnerability Finding Vulnerabilities -
Data Finding Sensitive Data -
Secret Finding Secrets -
IaC Misconfiguration IaC Misconfigurations -
SAST Finding SAST Findings 2 Medium 1 Info
Software Management Finding Software Management Findings -
Total 2 Medium 1 Info

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

Copy link
Contributor

@jelias2 jelias2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, look forward to landing this!

@charlesma4 charlesma4 requested a review from jelias2 January 14, 2026 14:50
@charlesma4 charlesma4 requested a review from jelias2 January 20, 2026 15:07
@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 63.13559% with 87 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.64%. Comparing base (81a603d) to head (296bc5e).

Files with missing lines Patch % Lines
proxyd/proxyd.go 22.44% 33 Missing and 5 partials ⚠️
proxyd/backend.go 15.78% 31 Missing and 1 partial ⚠️
proxyd/backend_probe.go 88.39% 9 Missing and 4 partials ⚠️
proxyd/server.go 85.18% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #534      +/-   ##
==========================================
+ Coverage   57.61%   57.64%   +0.02%     
==========================================
  Files          93       94       +1     
  Lines       13985    14201     +216     
==========================================
+ Hits         8058     8186     +128     
- Misses       5440     5518      +78     
- Partials      487      497      +10     
Flag Coverage Δ
proxyd 71.40% <63.13%> (-0.55%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
proxyd/config.go 61.90% <ø> (ø)
proxyd/metrics.go 85.45% <100.00%> (+1.45%) ⬆️
proxyd/server.go 76.85% <85.18%> (-0.27%) ⬇️
proxyd/backend_probe.go 88.39% <88.39%> (ø)
proxyd/backend.go 79.83% <15.78%> (-2.08%) ⬇️
proxyd/proxyd.go 54.35% <22.44%> (-3.25%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@charlesma4 charlesma4 force-pushed the charlesma/proxyd-healthcheck branch from 0df18a8 to 296bc5e Compare January 29, 2026 19:38
#2)

Changes:
- Set graceful_shutdown_seconds default to 0 (no delay) instead of 10s
- Remove graceful_shutdown_seconds = 0 from all test configs (no longer needed)
- Fix transport resource leak in ProbeWorker by closing connections on Stop()
- Make Drain() actually reject new RPC and WebSocket requests during drain period

Rationale:
The healthcheck and drain functionality should be disabled by default.
The previous 10-second default forced unnecessary changes to all existing
test configurations. Now configs only need to set graceful_shutdown_seconds
if they want to enable the drain delay.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
@jelias2 jelias2 merged commit f9bf450 into ethereum-optimism:main Feb 2, 2026
41 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants