Skip to content

Conversation

@daniel-lxs
Copy link
Member

@daniel-lxs daniel-lxs commented Sep 2, 2025

Summary

Implements a retry queue for cloud telemetry events to prevent data loss during network failures.

Changes

  • Add RetryQueue class with workspace-scoped persistence
  • Queue failed telemetry events for automatic retry
  • Retry events every 60 seconds with fresh auth tokens
  • FIFO eviction when queue reaches 100 events
  • Persist queue across VS Code restarts

Implementation Details

  • RetryQueue: Manages failed requests with persistence and automatic retry logic
  • CloudTelemetryClient: Integrates retry queue to catch network errors
  • CloudService: Initializes retry queue with auth header provider

Testing

  • All existing tests pass (229 tests)
  • Added 8 new tests for RetryQueue functionality
  • TypeScript compilation successful

Context

This feature was migrated from RooCodeInc/Roo-Code-Cloud#744 to ensure telemetry data isn't lost during network failures or temporary server issues.

The retry queue activates when telemetry events fail due to network errors. Events are retried until successful or evicted by newer events.


Important

Introduces RetryQueue for retrying failed telemetry events in CloudTelemetryClient, with persistence and auth state handling.

  • Behavior:
    • Introduces RetryQueue to handle failed telemetry events in CloudTelemetryClient.
    • Retries events every 60 seconds with fresh auth tokens.
    • Supports FIFO eviction when queue reaches 100 events.
    • Persists queue across VS Code restarts.
    • Handles network errors, server errors (5xx), and rate limiting (429).
    • Does not retry on client errors (4xx) except 429.
  • Integration:
    • CloudService initializes RetryQueue with auth header provider.
    • CloudTelemetryClient uses RetryQueue to queue failed requests.
  • Testing:
    • Adds 8 new tests for RetryQueue functionality.
    • Ensures queue handles auth state changes, network errors, and rate limits.
  • Misc:
    • Exports RetryQueue and related types in index.ts.

This description was created by Ellipsis for f4242ca. You can customize this summary. It will automatically update as commits are pushed.

- Implement RetryQueue class with workspace-scoped persistence
- Queue failed telemetry events for automatic retry
- Retry events every 60 seconds with fresh auth tokens
- FIFO eviction when queue reaches 100 events
- Persist queue across VS Code restarts

This ensures telemetry data isn't lost during network failures or temporary server issues.
Migrated from RooCodeInc/Roo-Code-Cloud#744
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Sep 2, 2025
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Sep 2, 2025
Copy link
Contributor

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing the telemetry retry queue! This is a valuable feature for network resilience. I've reviewed the changes and have some feedback:

Critical Issues (Must Fix):

  1. Unusual retry order logic - The retryAll() method in RetryQueue.ts processes the newest request first, then switches to oldest-first for remaining requests. This could lead to unpredictable retry behavior. Consider using consistent FIFO order.

  2. Missing error detection for retry-worthy failures - The retry queue only activates for TypeError with "fetch failed" message in TelemetryClient.ts. This misses other network errors like timeouts, DNS failures, or connection resets.

Important Suggestions (Should Consider):

  1. Potential memory leak with FormData - The backfillMessages method creates FormData with potentially large message arrays, but the retry mechanism might not handle multipart/form-data correctly when serialized.

  2. Missing retry limit enforcement - The maxRetries config is defined but never used. Failed requests could retry indefinitely.

  3. Race condition in retryAll() - The isProcessing flag prevents concurrent retry attempts, but multiple timer triggers could be lost.

Minor Improvements:

  1. Missing tests for retry logic - Tests verify queueing and persistence but don't test the actual retryAll() method.

  2. Hardcoded timeout value - The 30-second timeout in retryRequest should be configurable.

  3. No exponential backoff - Consider implementing exponential backoff to reduce server load during outages.

- Fix retry order to use consistent FIFO processing
- Add retry limit enforcement with max retries check
- Add configurable request timeout (default 30s)
- Add comprehensive tests for retryAll() method
- Add request-max-retries-exceeded event
- Fix timeout test to avoid timing issues
@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Needs Review] in Roo Code Roadmap Sep 9, 2025
- Handle HTTP error status codes (500s, 401/403, 429) as failures that trigger retry
- Remove queuing of backfill operations since they're user-initiated
- Fix race condition in concurrent retry processing with isProcessing flag
- Add specialized retry logic for 429 with Retry-After header support
- Clean up unnecessary comments
- Add comprehensive tests for new status code handling
- Add temporary debug logs with emojis for testing
@daniel-lxs daniel-lxs moved this from PR [Needs Review] to PR [Changes Requested] in Roo Code Roadmap Sep 10, 2025
- Remove unused X-Organization-Id header from auth header provider
- Simplify enqueue() API by removing operation parameter
- Fix error retry logic: only retry 5xx, 429, and network failures
- Stop retrying 4xx client errors (400, 401, 403, 404, 422)
- Implement queue-wide pause for 429 rate limiting
- Add auth state management integration:
  - Pause queue when not in active-session
  - Clear queue on logout or user change
  - Preserve queue when same user logs back in
- Remove debug comments
- Fix ESLint no-case-declarations error with proper block scope
- Update tests for all new behaviors
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 11, 2025
* - Clear queue when user logs out or logs in as different user
* - Resume queue when returning to active-session with same user
*/
private handleAuthStateChangeForRetryQueue(data: AuthStateChangedPayload): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleAuthStateChangeForRetryQueue cleanly manages pausing, clearing, and resuming the retry queue on different auth states. Consider refactoring the duplicate resume() calls (lines 417–424) for clarity.

// Check if we got a 429 rate limiting response
if (response && response.status === 429) {
const retryAfter = response.headers.get("Retry-After")
if (retryAfter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the 429 rate limiting branch, the code sets a global pause using the Retry-After header if present. Consider applying a default pause delay if the header is missing to avoid immediate reprocessing.

@daniel-lxs daniel-lxs moved this from PR [Changes Requested] to PR [Needs Review] in Roo Code Roadmap Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer PR - Needs Review size:XXL This PR changes 1000+ lines, ignoring generated files.