Skip to content

fix: enforce max waiting on control plane also if response is slow#2656

Open
alepane21 wants to merge 11 commits intomainfrom
ale/eng-9201-router-router-should-work-if-controlplane-is-down
Open

fix: enforce max waiting on control plane also if response is slow#2656
alepane21 wants to merge 11 commits intomainfrom
ale/eng-9201-router-router-should-work-if-controlplane-is-down

Conversation

@alepane21
Copy link
Copy Markdown
Contributor

@alepane21 alepane21 commented Mar 17, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced control plane registration resilience by implementing a 15-second timeout on HTTP requests, ensuring bounded registration attempts and preventing potential hangs.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/docs-website.
  • I have read the Contributors Guide.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 17, 2026

Walkthrough

Modified control plane registration to introduce a timeout-bounded HTTP client. Adjusted imports to include net/http and time, configured a new HTTP client with a 15-second timeout, and updated NodeServiceClient construction to use the new client instead of the retryablehttp client.

Changes

Cohort / File(s) Summary
HTTP Client Configuration
router/pkg/controlplane/selfregister/self_register.go
Added net/http and time imports; introduced httpClient with 15-second timeout; replaced retryablehttp client usage with the new timeout-bounded httpClient for NodeServiceClient initialization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately reflects the main change: introducing a 15-second HTTP client timeout to enforce maximum waiting on control plane communication, addressing slow responses.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 17, 2026

Router image scan failed

❌ Security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-a225a1fd4a2d77eb3d6a9d4e0ec7fef1ce139088

Please check the security vulnerabilities found in the PR.

If you believe this is a false positive, please add the vulnerability to the .trivyignore file and re-run the scan.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
router/pkg/controlplane/selfregister/self_register.go (1)

65-71: Timeout may limit retry effectiveness when control plane is slow.

The 15-second http.Client.Timeout applies to the entire operation including all retries, since StandardClient() wraps retry logic in its RoundTripper. With RetryMax=3 (4 total attempts) and DefaultBackoff waits (~7s total), only ~8 seconds remain for actual HTTP round-trips.

If the control plane is slow (e.g., first request takes 10s before failing), retries may never execute. This is fine if the intent is a hard 15-second cap regardless of retries, but consider:

  1. Increasing the timeout (e.g., 30-45s) to allow retries when responses are slow, or
  2. Setting retryClient.HTTPClient.Timeout instead, which applies per-attempt rather than total

If the current behavior is intentional (hard 15s cap trumps retries), the code is correct as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/pkg/controlplane/selfregister/self_register.go` around lines 65 - 71,
The 15s httpClient.Timeout set on the client returned by
retryClient.StandardClient() caps the entire multi-attempt operation (including
retries) and can prevent retries when the control plane is slow; to fix, either
increase the timeout (e.g., set httpClient.Timeout = 30*time.Second or
45*time.Second after calling retryClient.StandardClient()) if you want a larger
hard cap, or instead set the per-attempt timeout on the retry client itself by
configuring retryClient.HTTPClient.Timeout = 30*time.Second before calling
retryClient.StandardClient(); update the code around
retryClient.StandardClient(), httpClient.Timeout and where c.nodeServiceClient
is created to apply the chosen change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@router/pkg/controlplane/selfregister/self_register.go`:
- Around line 65-71: The 15s httpClient.Timeout set on the client returned by
retryClient.StandardClient() caps the entire multi-attempt operation (including
retries) and can prevent retries when the control plane is slow; to fix, either
increase the timeout (e.g., set httpClient.Timeout = 30*time.Second or
45*time.Second after calling retryClient.StandardClient()) if you want a larger
hard cap, or instead set the per-attempt timeout on the retry client itself by
configuring retryClient.HTTPClient.Timeout = 30*time.Second before calling
retryClient.StandardClient(); update the code around
retryClient.StandardClient(), httpClient.Timeout and where c.nodeServiceClient
is created to apply the chosen change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4981a0b-c497-4125-8c0a-d12688f7a40b

📥 Commits

Reviewing files that changed from the base of the PR and between 02ce559 and 35aa64b.

📒 Files selected for processing (1)
  • router/pkg/controlplane/selfregister/self_register.go

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.49%. Comparing base (2ca7b28) to head (cfaa401).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2656      +/-   ##
==========================================
+ Coverage   63.46%   63.49%   +0.02%     
==========================================
  Files         251      251              
  Lines       26767    26773       +6     
==========================================
+ Hits        16989    16999      +10     
+ Misses       8415     8392      -23     
- Partials     1363     1382      +19     
Files with missing lines Coverage Δ
...ter/pkg/controlplane/selfregister/self_register.go 65.38% <100.00%> (+65.38%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@StarpTech StarpTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 2, 2026
@alepane21 alepane21 removed the Stale label Apr 2, 2026
alepane21 and others added 4 commits April 2, 2026 17:00
…d-work-if-controlplane-is-down' into ale/eng-9201-router-router-should-work-if-controlplane-is-down
@alepane21 alepane21 requested a review from StarpTech April 2, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants