feat(cloud): Add telemetry retry queue for network resilience #7597

daniel-lxs · 2025-09-02T18:17:30Z

Summary

Implements a retry queue for cloud telemetry events to prevent data loss during network failures.

Changes

Add RetryQueue class with workspace-scoped persistence
Queue failed telemetry events for automatic retry
Retry events every 60 seconds with fresh auth tokens
FIFO eviction when queue reaches 100 events
Persist queue across VS Code restarts

Implementation Details

RetryQueue: Manages failed requests with persistence and automatic retry logic
CloudTelemetryClient: Integrates retry queue to catch network errors
CloudService: Initializes retry queue with auth header provider

Testing

All existing tests pass (229 tests)
Added 8 new tests for RetryQueue functionality
TypeScript compilation successful

Context

This feature was migrated from RooCodeInc/Roo-Code-Cloud#744 to ensure telemetry data isn't lost during network failures or temporary server issues.

The retry queue activates when telemetry events fail due to network errors. Events are retried until successful or evicted by newer events.

Important

Introduces RetryQueue for retrying failed telemetry events in CloudTelemetryClient, with persistence and auth state handling.

Behavior:
- Introduces RetryQueue to handle failed telemetry events in CloudTelemetryClient.
- Retries events every 60 seconds with fresh auth tokens.
- Supports FIFO eviction when queue reaches 100 events.
- Persists queue across VS Code restarts.
- Handles network errors, server errors (5xx), and rate limiting (429).
- Does not retry on client errors (4xx) except 429.
Integration:
- CloudService initializes RetryQueue with auth header provider.
- CloudTelemetryClient uses RetryQueue to queue failed requests.
Testing:
- Adds 8 new tests for RetryQueue functionality.
- Ensures queue handles auth state changes, network errors, and rate limits.
Misc:
- Exports RetryQueue and related types in index.ts.

^{This description was created by}^{for f4242ca. You can customize this summary. It will automatically update as commits are pushed.}

- Implement RetryQueue class with workspace-scoped persistence - Queue failed telemetry events for automatic retry - Retry events every 60 seconds with fresh auth tokens - FIFO eviction when queue reaches 100 events - Persist queue across VS Code restarts This ensures telemetry data isn't lost during network failures or temporary server issues. Migrated from RooCodeInc/Roo-Code-Cloud#744

roomote

Thank you for implementing the telemetry retry queue! This is a valuable feature for network resilience. I've reviewed the changes and have some feedback:

Critical Issues (Must Fix):

Unusual retry order logic - The retryAll() method in RetryQueue.ts processes the newest request first, then switches to oldest-first for remaining requests. This could lead to unpredictable retry behavior. Consider using consistent FIFO order.
Missing error detection for retry-worthy failures - The retry queue only activates for TypeError with "fetch failed" message in TelemetryClient.ts. This misses other network errors like timeouts, DNS failures, or connection resets.

Important Suggestions (Should Consider):

Potential memory leak with FormData - The backfillMessages method creates FormData with potentially large message arrays, but the retry mechanism might not handle multipart/form-data correctly when serialized.
Missing retry limit enforcement - The maxRetries config is defined but never used. Failed requests could retry indefinitely.
Race condition in retryAll() - The isProcessing flag prevents concurrent retry attempts, but multiple timer triggers could be lost.

Minor Improvements:

Missing tests for retry logic - Tests verify queueing and persistence but don't test the actual retryAll() method.
Hardcoded timeout value - The 30-second timeout in retryRequest should be configurable.
No exponential backoff - Consider implementing exponential backoff to reduce server load during outages.

- Fix retry order to use consistent FIFO processing - Add retry limit enforcement with max retries check - Add configurable request timeout (default 30s) - Add comprehensive tests for retryAll() method - Add request-max-retries-exceeded event - Fix timeout test to avoid timing issues

- Handle HTTP error status codes (500s, 401/403, 429) as failures that trigger retry - Remove queuing of backfill operations since they're user-initiated - Fix race condition in concurrent retry processing with isProcessing flag - Add specialized retry logic for 429 with Retry-After header support - Clean up unnecessary comments - Add comprehensive tests for new status code handling - Add temporary debug logs with emojis for testing

- Remove unused X-Organization-Id header from auth header provider - Simplify enqueue() API by removing operation parameter - Fix error retry logic: only retry 5xx, 429, and network failures - Stop retrying 4xx client errors (400, 401, 403, 404, 422) - Implement queue-wide pause for 429 rate limiting - Add auth state management integration: - Pause queue when not in active-session - Clear queue on logout or user change - Preserve queue when same user logs back in - Remove debug comments - Fix ESLint no-case-declarations error with proper block scope - Update tests for all new behaviors

ellipsis-dev · 2025-09-11T15:05:06Z

packages/cloud/src/CloudService.ts

+	 * - Clear queue when user logs out or logs in as different user
+	 * - Resume queue when returning to active-session with same user
+	 */
+	private handleAuthStateChangeForRetryQueue(data: AuthStateChangedPayload): void {


handleAuthStateChangeForRetryQueue cleanly manages pausing, clearing, and resuming the retry queue on different auth states. Consider refactoring the duplicate resume() calls (lines 417–424) for clarity.

ellipsis-dev · 2025-09-11T15:05:06Z

packages/cloud/src/retry-queue/RetryQueue.ts

+					// Check if we got a 429 rate limiting response
+					if (response && response.status === 429) {
+						const retryAfter = response.headers.get("Retry-After")
+						if (retryAfter) {


In the 429 rate limiting branch, the code sets a global pause using the Retry-After header if present. Consider applying a default pause delay if the header is missing to avoid immediate reprocessing.

daniel-lxs requested review from cte, jr and mrubens as code owners September 2, 2025 18:17

github-project-automation bot added this to Roo Code Roadmap and Roo Code Roadmap Sep 2, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Sep 2, 2025

github-project-automation bot moved this to Triage in Roo Code Roadmap Sep 2, 2025

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Sep 2, 2025

daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Sep 2, 2025

hannesrudolph added the PR - Needs Preliminary Review label Sep 2, 2025

roomote bot reviewed Sep 2, 2025

View reviewed changes

daniel-lxs added 2 commits September 9, 2025 11:08

fix: resolve TypeScript errors in RetryQueue tests

59393d8

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Needs Review] in Roo Code Roadmap Sep 9, 2025

hannesrudolph added PR - Needs Review and removed PR - Needs Preliminary Review labels Sep 9, 2025

daniel-lxs moved this from PR [Needs Review] to PR [Changes Requested] in Roo Code Roadmap Sep 10, 2025

hannesrudolph added PR - Changes Requested and removed PR - Needs Review labels Sep 11, 2025

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 11, 2025

ellipsis-dev bot reviewed Sep 11, 2025

View reviewed changes

daniel-lxs moved this from PR [Changes Requested] to PR [Needs Review] in Roo Code Roadmap Sep 11, 2025

hannesrudolph added PR - Needs Review and removed PR - Changes Requested labels Sep 11, 2025

jr approved these changes Sep 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cloud): Add telemetry retry queue for network resilience #7597

feat(cloud): Add telemetry retry queue for network resilience #7597

Uh oh!

daniel-lxs commented Sep 2, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

roomote bot left a comment

Uh oh!

ellipsis-dev bot Sep 11, 2025

Uh oh!

ellipsis-dev bot Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

feat(cloud): Add telemetry retry queue for network resilience #7597

feat(cloud): Add telemetry retry queue for network resilience #7597

Uh oh!

Conversation

daniel-lxs commented Sep 2, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Implementation Details

Testing

Context

Uh oh!

roomote bot left a comment

Choose a reason for hiding this comment

Critical Issues (Must Fix):

Important Suggestions (Should Consider):

Minor Improvements:

Uh oh!

ellipsis-dev bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

daniel-lxs commented Sep 2, 2025 •

edited by ellipsis-dev bot

Loading