feat: Handle Secret Lookup Rate Limit errors#2866
Conversation
6f14889 to
a1b1df9
Compare
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@checks/runchecks.go`:
- Around line 199-201: RunChecksNoPersistence is calling
filterSecretLookupRateLimitedResults which consumes tokens from the global
secretLookupRateLimiter even though its skipped counts are discarded; change
RunChecksNoPersistence to avoid draining the shared limiter by either (A)
skipping the call to filterSecretLookupRateLimitedResults in the no-persistence
path and directly use transformedResults for ExportCheckMetrics, or (B) modify
filterSecretLookupRateLimitedResults to accept an injected limiter and pass a
per-canary/no-op limiter from RunChecksNoPersistence so the global
secretLookupRateLimiter is not consumed; ensure ExportCheckMetrics still
receives the correct filteredResults or equivalent and preserve any
skipped-count reporting logic local to the canary if needed.
- Around line 32-35: The package-level singleton secretLookupRateLimiter causes
token sharing across canaries; replace it with a per-canary limiter registry
(e.g., a sync.Map or existing gocache) and provide a helper like
getSecretLookupLimiter(canaryID string) that looks up or creates a
rate.NewLimiter(rate.Limit(float64(defaultSecretLookupFailureThreshold)/defaultSecretLookupFailureWindow.Seconds()),
defaultSecretLookupFailureThreshold) for that canary; update call sites that
used secretLookupRateLimiter to call getSecretLookupLimiter(canary.ID) and
consider eviction/cleanup of stale limiters if using a long-lived map.
🧹 Nitpick comments (2)
pkg/jobs/canary/canary_jobs.go (1)
172-192: Simplify the random number generation — Go 1.20+ supports auto-seeded, thread-saferand.Intn, so the localrand.New(rand.NewSource(...))is unnecessary:- rng := rand.New(rand.NewSource(time.Now().UnixNano())) - delay := time.Minute + time.Duration(rng.Intn(60))*time.Second + delay := time.Minute + time.Duration(rand.Intn(60))*time.SecondConsider a safeguard against excessive retry chains — if a canary's rate-limited job keeps failing every 1-2 minutes while its next scheduled run is > 5 minutes away,
maybeRescheduleAfterSecretLookupRateLimitwill reschedule repeatedly, creating a chain of retries. The 5-minute check provides some protection but doesn't prevent multiple consecutive retries. Consider adding a max retry limit or cooldown tracking to avoid potential thundering herd behavior during sustained rate-limiting.pkg/topology/run.go (1)
223-226: Switching toRunChecksNoPersistenceis appropriate for topology lookups.The return signature
([]*pkg.CheckResult, error)matches the existing destructuring. Topology lookups don't need persistence, so this is a clean fit.One consideration:
RunChecksNoPersistencesilently filters out rate-limited results (discards the skipped count inchecks/runchecks.goline 198). If a topology lookup is rate-limited, the caller here gets an incomplete result set with no indication that entries were dropped. This could lead to topology components silently disappearing during rate-limit episodes. You may want to at least log a warning insideRunChecksNoPersistencewhen results are filtered, or surface the count to the caller.
resolves: #2818
Summary by CodeRabbit
Release Notes
New Features
Refactor