-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add persistent retry queue for failed telemetry events (#4940) #6567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Implement TelemetryQueueManager with VSCode storage persistence - Add ConnectionMonitor for real-time connection status tracking - Queue failed telemetry events with exponential backoff retry - Add priority queue for error events - Show connection status in account view UI - Add comprehensive test coverage (43 tests) - Include feature flag for safe rollout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a persistent telemetry queue system to ensure no telemetry data is lost during network outages or API downtime. The system automatically queues failed telemetry events and retries them when connectivity is restored.
- Adds persistent storage for failed telemetry events with priority handling and exponential backoff
- Implements real-time connection monitoring with health checks every 30 seconds
- Integrates queue management seamlessly into existing telemetry workflow with feature flag control
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/cloud/src/TelemetryQueueManager.ts | Core queue management with persistent storage, retry logic, and batch processing |
| packages/cloud/src/ConnectionMonitor.ts | Real-time connection status monitoring with event-driven notifications |
| packages/cloud/src/TelemetryClient.ts | Enhanced telemetry client with queue integration and error handling |
| packages/cloud/src/CloudService.ts | Service integration with connection monitoring and queue processing |
| webview-ui/src/components/account/AccountView.tsx | UI component showing offline status warning to users |
| packages/types/src/telemetry.ts | Type definitions for queued telemetry events and state |
| packages/types/src/global-settings.ts | Settings schema for queue storage and feature flag |
|
|
||
| this.processQueueDebounceTimer = setTimeout(() => { |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using AbortController to handle cleanup of the timeout more robustly, especially since this is in an async context where the instance might be disposed before the timeout executes.
| this.processQueueDebounceTimer = setTimeout(() => { | |
| if (this.processQueueAbortController) { | |
| this.processQueueAbortController.abort(); | |
| } | |
| this.processQueueAbortController = new AbortController(); | |
| const signal = this.processQueueAbortController.signal; | |
| this.processQueueDebounceTimer = setTimeout(() => { | |
| if (signal.aborted) { | |
| return; | |
| } |
| public async checkConnection(): Promise<boolean> { | ||
| try { | ||
| const controller = new AbortController() | ||
| const timeoutId = setTimeout(() => controller.abort(), 5000) // 5 second timeout |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value of 5000ms is hardcoded. Consider making this configurable or storing it as a class constant for better maintainability.
| const timeoutId = setTimeout(() => controller.abort(), 5000) // 5 second timeout | |
| const timeoutId = setTimeout(() => controller.abort(), this.defaultTimeoutMs) // 5 second timeout |
| if (storedQueue.length > this.maxQueueSize * 2) { | ||
| this.log("[TelemetryQueueManager] Queue size exceeds safety limit, truncating to max size") | ||
| this.queue = (storedQueue as QueuedTelemetryEvent[]).slice(-this.maxQueueSize) |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue validation multiplies maxQueueSize by 2 without bounds checking. Consider adding an absolute maximum limit to prevent potential memory exhaustion attacks through corrupted storage.
| if (storedQueue.length > this.maxQueueSize * 2) { | |
| this.log("[TelemetryQueueManager] Queue size exceeds safety limit, truncating to max size") | |
| this.queue = (storedQueue as QueuedTelemetryEvent[]).slice(-this.maxQueueSize) | |
| const effectiveMaxSize = Math.min(this.maxQueueSize, TelemetryQueueManager.ABSOLUTE_MAX_QUEUE_SIZE) | |
| if (storedQueue.length > effectiveMaxSize) { | |
| this.log("[TelemetryQueueManager] Queue size exceeds safety limit, truncating to max allowed size") | |
| this.queue = (storedQueue as QueuedTelemetryEvent[]).slice(-effectiveMaxSize) |
| */ | ||
| public async addToQueue(event: TelemetryEvent, priority: "high" | "normal" = "normal"): Promise<void> { | ||
| const queuedEvent: QueuedTelemetryEvent = { | ||
| id: crypto.randomUUID(), |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crypto.randomUUID() may not be available in all environments. Consider checking for availability and providing a fallback implementation.
| id: crypto.randomUUID(), | |
| id: generateUUID(), |
packages/cloud/src/CloudService.ts
Outdated
|
|
||
| if (isQueueEnabled) { | ||
| // Set up connection monitoring with debouncing | ||
| let connectionRestoredDebounceTimer: NodeJS.Timeout | null = null |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debounce timer is declared in a closure but not properly cleaned up on dispose. This could lead to memory leaks if the service is disposed while a timer is pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! I've reviewed the changes and found several issues that need attention, including unresolved comments from the previous review.
| } | ||
| this.settingsService.dispose() | ||
| } | ||
| if (this.connectionMonitor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Memory leak - The connectionRestoredDebounceTimer declared on line 113 is not cleaned up in the dispose() method. This could cause memory leaks if the service is disposed while a timer is pending.
Consider adding cleanup in the dispose method:
| if (this.connectionMonitor) { | |
| if (this.connectionMonitor) { | |
| this.connectionMonitor.dispose(); | |
| } | |
| // Clean up any pending debounce timer | |
| if (connectionRestoredDebounceTimer) { | |
| clearTimeout(connectionRestoredDebounceTimer); | |
| } |
| clearTimeout(this.processQueueDebounceTimer) | ||
| } | ||
|
|
||
| this.processQueueDebounceTimer = setTimeout(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous suggestion about using AbortController for more robust timeout handling hasn't been addressed. The current implementation could have issues if the instance is disposed before the timeout executes. Would you consider implementing the AbortController pattern for better cleanup?
| public async checkConnection(): Promise<boolean> { | ||
| try { | ||
| const controller = new AbortController() | ||
| const timeoutId = setTimeout(() => controller.abort(), 5000) // 5 second timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value is still hardcoded as 5000ms. Consider making this configurable by adding a class constant:
| const timeoutId = setTimeout(() => controller.abort(), 5000) // 5 second timeout | |
| private readonly defaultTimeoutMs = 5000 | |
| const timeoutId = setTimeout(() => controller.abort(), this.defaultTimeoutMs) |
|
|
||
| if (storedQueue && Array.isArray(storedQueue)) { | ||
| // Add validation for queue size to prevent memory issues | ||
| if (storedQueue.length > this.maxQueueSize * 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: The queue validation multiplies maxQueueSize by 2 without bounds checking. This could lead to memory exhaustion if the stored queue is corrupted. Consider adding an absolute maximum:
| if (storedQueue.length > this.maxQueueSize * 2) { | |
| private static readonly ABSOLUTE_MAX_QUEUE_SIZE = 5000; | |
| // ... | |
| const effectiveMaxSize = Math.min(this.maxQueueSize * 2, TelemetryQueueManager.ABSOLUTE_MAX_QUEUE_SIZE); | |
| if (storedQueue.length > effectiveMaxSize) { |
| */ | ||
| public async addToQueue(event: TelemetryEvent, priority: "high" | "normal" = "normal"): Promise<void> { | ||
| const queuedEvent: QueuedTelemetryEvent = { | ||
| id: crypto.randomUUID(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crypto.randomUUID() may not be available in all environments. Consider adding a fallback:
| id: crypto.randomUUID(), | |
| id: typeof crypto !== 'undefined' && crypto.randomUUID ? crypto.randomUUID() : `fallback-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`, |
| } | ||
|
|
||
| // Process each event individually to maintain compatibility | ||
| for (const queuedEvent of events) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling could be improved in processBatchedEvents. Currently, if one event fails validation, it continues but if fetch fails for one event, it could affect the entire batch. Consider wrapping individual event processing in try-catch to ensure one bad event doesn't stop the entire batch.
| // Check if telemetry queue is enabled | ||
| let isQueueEnabled = true | ||
| try { | ||
| const { ContextProxy } = await import("../../../src/core/config/ContextProxy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dynamic import for ContextProxy could fail. Consider adding more robust error handling:
| const { ContextProxy } = await import("../../../src/core/config/ContextProxy") | |
| try { | |
| const { ContextProxy } = await import("../../../src/core/config/ContextProxy") | |
| isQueueEnabled = ContextProxy.instance.getValue("telemetryQueueEnabled") ?? true | |
| } catch (error) { | |
| // Default to enabled if we can't access settings | |
| this.log("[CloudService] Could not access telemetryQueueEnabled setting:", error) | |
| isQueueEnabled = true | |
| } |
- Added offline warning translations for 17 additional languages - Covers all supported locales: ca, de, es, fr, hi, id, it, ja, ko, nl, pl, pt-BR, ru, tr, vi, zh-CN, zh-TW
| "cloudBenefitMetrics": "Mètriques d'ús basades en tasques, tokens i costos", | ||
| "visitCloudWebsite": "Visita Roo Code Cloud" | ||
| "visitCloudWebsite": "Visita Roo Code Cloud", | ||
| "offlineWarning": "Ara mateix estàs sense connexió. Els esdeveniments de telemetria s'encularan i s'enviaran quan es restableixi la connexió." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typographical error: In the offlineWarning message, the word “s'encularan” appears to be a misspelling. It likely should be “s'encolaran” to correctly convey that telemetry events will be queued when the connection is restored.
| "offlineWarning": "Ara mateix estàs sense connexió. Els esdeveniments de telemetria s'encularan i s'enviaran quan es restableixi la connexió." | |
| "offlineWarning": "Ara mateix estàs sense connexió. Els esdeveniments de telemetria s'encolaran i s'enviaran quan es restableixi la connexió." |
- Fix memory leak by properly cleaning up connectionRestoredDebounceTimer in CloudService dispose() - Implement AbortController pattern in TelemetryClient for robust timeout handling - Make timeout value configurable using class constant in ConnectionMonitor - Add security bounds checking with ABSOLUTE_MAX_QUEUE_SIZE to prevent memory exhaustion - Add fallback for crypto.randomUUID() for environments where it's not available - Improve error handling in processBatchedEvents() to handle individual event failures - Add proper error logging for dynamic import failures in CloudService - Fix typo in Catalan translation (s'encularan -> s'encolaran)
Related GitHub Issue
Closes: #4940
Roo Code Task Context (Optional)
No Roo Code task context for this PR
Description
This PR implements a persistent telemetry queue system that stores failed telemetry events locally and retries them when the connection to Roo Code Cloud is restored. This ensures no telemetry data is lost during network outages or API downtime.
Key Implementation Details:
Review Fixes Applied:
During internal review, the following critical issues were identified and fixed:
Test Procedure
Automated Tests:
Manual Testing Steps:
"telemetryQueueEnabled": true(default)Pre-Submission Checklist
Screenshots / Videos
No UI changes in this PR (only connection status indicator in account view)
Documentation Updates
Additional Notes
Performance Considerations:
Future Enhancements:
Known Limitations:
Get in Touch
@hannesrudolph
Important
Adds a persistent retry queue for telemetry events with connection monitoring and UI updates for offline status.
TelemetryQueueManagerfor persistent storage and retry of failed telemetry events with exponential backoff.ConnectionMonitorfor real-time connection status checks every 30 seconds.CloudServiceto integrate telemetry queue and connection monitoring.AccountViewwhen offline.cloudIsOnlinestate toExtensionStateandAccountView.App.tsxto handlecloudIsOnlinestate.ConnectionMonitorandTelemetryQueueManager.ExtensionStateContexttests to include new state properties.telemetryQueueEnabledfeature flag toglobal-settings.ts.TelemetryClientto use logging service instead ofconsole.error.This description was created by
for 448a4f0. You can customize this summary. It will automatically update as commits are pushed.