-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(cloud): Add telemetry retry queue for network resilience #7597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Implement RetryQueue class with workspace-scoped persistence - Queue failed telemetry events for automatic retry - Retry events every 60 seconds with fresh auth tokens - FIFO eviction when queue reaches 100 events - Persist queue across VS Code restarts This ensures telemetry data isn't lost during network failures or temporary server issues. Migrated from RooCodeInc/Roo-Code-Cloud#744
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing the telemetry retry queue! This is a valuable feature for network resilience. I've reviewed the changes and have some feedback:
Critical Issues (Must Fix):
-
Unusual retry order logic - The
retryAll()method in RetryQueue.ts processes the newest request first, then switches to oldest-first for remaining requests. This could lead to unpredictable retry behavior. Consider using consistent FIFO order. -
Missing error detection for retry-worthy failures - The retry queue only activates for
TypeErrorwith "fetch failed" message in TelemetryClient.ts. This misses other network errors like timeouts, DNS failures, or connection resets.
Important Suggestions (Should Consider):
-
Potential memory leak with FormData - The
backfillMessagesmethod creates FormData with potentially large message arrays, but the retry mechanism might not handle multipart/form-data correctly when serialized. -
Missing retry limit enforcement - The
maxRetriesconfig is defined but never used. Failed requests could retry indefinitely. -
Race condition in retryAll() - The
isProcessingflag prevents concurrent retry attempts, but multiple timer triggers could be lost.
Minor Improvements:
-
Missing tests for retry logic - Tests verify queueing and persistence but don't test the actual
retryAll()method. -
Hardcoded timeout value - The 30-second timeout in
retryRequestshould be configurable. -
No exponential backoff - Consider implementing exponential backoff to reduce server load during outages.
- Fix retry order to use consistent FIFO processing - Add retry limit enforcement with max retries check - Add configurable request timeout (default 30s) - Add comprehensive tests for retryAll() method - Add request-max-retries-exceeded event - Fix timeout test to avoid timing issues
- Handle HTTP error status codes (500s, 401/403, 429) as failures that trigger retry - Remove queuing of backfill operations since they're user-initiated - Fix race condition in concurrent retry processing with isProcessing flag - Add specialized retry logic for 429 with Retry-After header support - Clean up unnecessary comments - Add comprehensive tests for new status code handling - Add temporary debug logs with emojis for testing
- Remove unused X-Organization-Id header from auth header provider - Simplify enqueue() API by removing operation parameter - Fix error retry logic: only retry 5xx, 429, and network failures - Stop retrying 4xx client errors (400, 401, 403, 404, 422) - Implement queue-wide pause for 429 rate limiting - Add auth state management integration: - Pause queue when not in active-session - Clear queue on logout or user change - Preserve queue when same user logs back in - Remove debug comments - Fix ESLint no-case-declarations error with proper block scope - Update tests for all new behaviors
| * - Clear queue when user logs out or logs in as different user | ||
| * - Resume queue when returning to active-session with same user | ||
| */ | ||
| private handleAuthStateChangeForRetryQueue(data: AuthStateChangedPayload): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleAuthStateChangeForRetryQueue cleanly manages pausing, clearing, and resuming the retry queue on different auth states. Consider refactoring the duplicate resume() calls (lines 417–424) for clarity.
| // Check if we got a 429 rate limiting response | ||
| if (response && response.status === 429) { | ||
| const retryAfter = response.headers.get("Retry-After") | ||
| if (retryAfter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the 429 rate limiting branch, the code sets a global pause using the Retry-After header if present. Consider applying a default pause delay if the header is missing to avoid immediate reprocessing.
Summary
Implements a retry queue for cloud telemetry events to prevent data loss during network failures.
Changes
Implementation Details
Testing
Context
This feature was migrated from RooCodeInc/Roo-Code-Cloud#744 to ensure telemetry data isn't lost during network failures or temporary server issues.
The retry queue activates when telemetry events fail due to network errors. Events are retried until successful or evicted by newer events.
Important
Introduces
RetryQueuefor retrying failed telemetry events inCloudTelemetryClient, with persistence and auth state handling.RetryQueueto handle failed telemetry events inCloudTelemetryClient.CloudServiceinitializesRetryQueuewith auth header provider.CloudTelemetryClientusesRetryQueueto queue failed requests.RetryQueuefunctionality.RetryQueueand related types inindex.ts.This description was created by
for f4242ca. You can customize this summary. It will automatically update as commits are pushed.