Skip to content

Commit 2170f58

Browse files
committed
Fixes #4940: Add persistent retry queue for failed telemetry events
- Implement TelemetryRetryQueue with persistent storage using VSCode globalState - Add ResilientTelemetryClient wrapper for automatic retry functionality - Implement exponential backoff retry strategy with configurable limits - Add priority handling for critical events (errors, crashes) - Provide connection status monitoring with visual feedback - Add VSCode settings for user control of retry behavior - Include comprehensive test coverage for all components - Add status bar indicator and user notifications for connection issues - Support manual retry triggers and queue management commands - Ensure graceful degradation when retry system fails Features: - Persistent queue survives extension restarts and VSCode crashes - Configurable retry limits, delays, and queue sizes - High priority events (errors) are processed before normal events - Batch processing for efficient network usage - User notifications for prolonged disconnection - Manual queue management through commands - Comprehensive documentation and examples
1 parent 2e2f83b commit 2170f58

File tree

13 files changed

+2729
-21
lines changed

13 files changed

+2729
-21
lines changed

docs/telemetry-retry-queue.md

Lines changed: 364 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,364 @@
1+
# Telemetry Retry Queue
2+
3+
This document describes the persistent retry queue system for failed telemetry events in Roo Code.
4+
5+
## Overview
6+
7+
The telemetry retry queue ensures that telemetry events are never lost due to temporary network issues, server downtime, or other connectivity problems. It provides a robust delivery system with the following features:
8+
9+
- **Persistent Storage**: Events are stored locally using VSCode's globalState API and survive extension restarts
10+
- **Exponential Backoff**: Failed events are retried with increasing delays to avoid overwhelming the server
11+
- **Priority Handling**: Critical events (errors, crashes) are prioritized over routine analytics
12+
- **Connection Monitoring**: Tracks connection status and provides user feedback
13+
- **Configurable Behavior**: Users can control retry behavior through VSCode settings
14+
15+
## Architecture
16+
17+
### Components
18+
19+
1. **TelemetryRetryQueue**: Core queue management with persistent storage
20+
2. **ResilientTelemetryClient**: Wrapper that adds retry functionality to any TelemetryClient
21+
3. **Configuration Settings**: VSCode settings for user control
22+
4. **Status Monitoring**: Visual feedback through status bar and notifications
23+
24+
### Flow
25+
26+
```
27+
Telemetry Event → Immediate Send Attempt → Success? → Done
28+
↓ Failure
29+
Add to Retry Queue
30+
31+
Periodic Retry Processing
32+
33+
Exponential Backoff
34+
35+
Success or Max Retries
36+
```
37+
38+
## Configuration
39+
40+
### VSCode Settings
41+
42+
Users can configure the retry behavior through the following settings:
43+
44+
- `roo-cline.telemetryRetryEnabled` (boolean, default: true)
45+
- Enable/disable the retry queue system
46+
- `roo-cline.telemetryRetryMaxRetries` (number, default: 5, range: 0-10)
47+
- Maximum number of retry attempts per event
48+
- `roo-cline.telemetryRetryBaseDelay` (number, default: 1000ms, range: 100-10000ms)
49+
- Base delay between retry attempts (exponential backoff)
50+
- `roo-cline.telemetryRetryMaxDelay` (number, default: 300000ms, range: 1000-600000ms)
51+
- Maximum delay between retry attempts (5 minutes default)
52+
- `roo-cline.telemetryRetryQueueSize` (number, default: 1000, range: 10-10000)
53+
- Maximum number of events to queue for retry
54+
- `roo-cline.telemetryRetryNotifications` (boolean, default: true)
55+
- Show notifications when connection issues are detected
56+
57+
### Programmatic Configuration
58+
59+
```typescript
60+
import { TelemetryRetryQueue, RetryQueueConfig } from "@roo-code/telemetry"
61+
62+
const config: Partial<RetryQueueConfig> = {
63+
maxRetries: 3,
64+
baseDelayMs: 2000,
65+
maxDelayMs: 60000,
66+
maxQueueSize: 500,
67+
batchSize: 5,
68+
enableNotifications: false,
69+
}
70+
71+
const retryQueue = new TelemetryRetryQueue(context, config)
72+
```
73+
74+
## Usage
75+
76+
### Basic Usage
77+
78+
The retry queue is automatically integrated into the telemetry system. No additional code is required for basic functionality:
79+
80+
```typescript
81+
// This automatically uses the retry queue if the send fails
82+
TelemetryService.instance.captureTaskCreated("task-123")
83+
```
84+
85+
### Advanced Usage
86+
87+
For custom telemetry clients, wrap them with `ResilientTelemetryClient`:
88+
89+
```typescript
90+
import { ResilientTelemetryClient } from "@roo-code/telemetry"
91+
92+
const originalClient = new MyTelemetryClient()
93+
const resilientClient = new ResilientTelemetryClient(originalClient, context)
94+
95+
// Register with telemetry service
96+
TelemetryService.instance.register(resilientClient)
97+
```
98+
99+
### Manual Queue Management
100+
101+
```typescript
102+
// Get queue status
103+
const status = await resilientClient.getQueueStatus()
104+
console.log(`Queue size: ${status.queueSize}`)
105+
console.log(`Connected: ${status.connectionStatus.isConnected}`)
106+
107+
// Manually trigger retry
108+
await resilientClient.retryNow()
109+
110+
// Clear queue
111+
await resilientClient.clearQueue()
112+
113+
// Update configuration
114+
resilientClient.updateRetryConfig({ maxRetries: 10 })
115+
```
116+
117+
## Priority System
118+
119+
Events are automatically prioritized based on their importance:
120+
121+
### High Priority Events
122+
123+
- `SCHEMA_VALIDATION_ERROR`
124+
- `DIFF_APPLICATION_ERROR`
125+
- `SHELL_INTEGRATION_ERROR`
126+
- `CONSECUTIVE_MISTAKE_ERROR`
127+
128+
### Normal Priority Events
129+
130+
- All other telemetry events (task creation, completion, etc.)
131+
132+
High priority events are:
133+
134+
- Processed before normal priority events
135+
- Retained longer when queue size limits are reached
136+
- Given preference during batch processing
137+
138+
## Storage
139+
140+
### Persistence
141+
142+
Events are stored in VSCode's `globalState` under the key `telemetryRetryQueue`. This ensures:
143+
144+
- Data survives extension restarts
145+
- Data survives VSCode crashes
146+
- Data is automatically cleaned up when the extension is uninstalled
147+
148+
### Storage Format
149+
150+
```typescript
151+
interface QueuedTelemetryEvent {
152+
id: string // Unique identifier
153+
event: TelemetryEvent // Original event data
154+
timestamp: number // When event was first queued
155+
retryCount: number // Number of retry attempts
156+
nextRetryAt: number // When to retry next
157+
priority: "high" | "normal" // Event priority
158+
}
159+
```
160+
161+
### Size Management
162+
163+
- Queue size is limited by `maxQueueSize` setting
164+
- When limit is reached, oldest normal priority events are removed first
165+
- High priority events are preserved longer
166+
- Automatic cleanup of successfully sent events
167+
168+
## Retry Logic
169+
170+
### Exponential Backoff
171+
172+
Retry delays follow an exponential backoff pattern:
173+
174+
```
175+
delay = min(baseDelayMs * 2^retryCount, maxDelayMs)
176+
```
177+
178+
Example with default settings (baseDelayMs=1000ms, maxDelayMs=300000ms):
179+
180+
- Retry 1: 1 second
181+
- Retry 2: 2 seconds
182+
- Retry 3: 4 seconds
183+
- Retry 4: 8 seconds
184+
- Retry 5: 16 seconds
185+
- Further retries: 5 minutes (maxDelayMs)
186+
187+
### Batch Processing
188+
189+
- Events are processed in batches to improve efficiency
190+
- Default batch size: 10 events
191+
- Batches are processed every 30 seconds
192+
- Failed events in a batch are individually rescheduled
193+
194+
### Failure Handling
195+
196+
- Temporary failures (network errors): Event is rescheduled for retry
197+
- Permanent failures (authentication errors): Event may be dropped
198+
- Max retries exceeded: Event is removed from queue
199+
- Invalid events: Event is dropped immediately
200+
201+
## User Interface
202+
203+
### Status Bar
204+
205+
When events are queued, a status bar item appears showing:
206+
207+
- Queue size
208+
- Connection status (connected/disconnected)
209+
- Click to view queue details
210+
211+
### Notifications
212+
213+
When enabled, users receive notifications for:
214+
215+
- Prolonged disconnection (>5 minutes)
216+
- Large queue buildup
217+
- Option to manually trigger retry or disable notifications
218+
219+
### Commands
220+
221+
The following commands are available:
222+
223+
- `roo-code.telemetry.showQueue`: Display queue status and management options
224+
- `roo-code.telemetry.retryNow`: Manually trigger retry processing
225+
- `roo-code.telemetry.clearQueue`: Clear all queued events
226+
227+
## Monitoring
228+
229+
### Connection Status
230+
231+
The system tracks:
232+
233+
- `isConnected`: Current connection state
234+
- `lastSuccessfulSend`: Timestamp of last successful telemetry send
235+
- `consecutiveFailures`: Number of consecutive send failures
236+
237+
Connection is considered lost after 3 consecutive failures.
238+
239+
### Metrics
240+
241+
Internal metrics tracked:
242+
243+
- Queue size over time
244+
- Retry success/failure rates
245+
- Average retry delays
246+
- Event priority distribution
247+
248+
## Error Handling
249+
250+
### Graceful Degradation
251+
252+
- If retry queue initialization fails, telemetry continues without retry
253+
- Storage errors are logged but don't prevent telemetry operation
254+
- Invalid queue data is automatically cleaned up
255+
256+
### Error Logging
257+
258+
Errors are logged with appropriate levels:
259+
260+
- Warnings: Temporary failures, retry attempts
261+
- Errors: Persistent failures, configuration issues
262+
- Info: Successful operations, queue status changes
263+
264+
## Testing
265+
266+
### Unit Tests
267+
268+
Comprehensive test coverage includes:
269+
270+
- Queue operations (enqueue, dequeue, prioritization)
271+
- Retry logic (exponential backoff, max retries)
272+
- Storage persistence
273+
- Configuration handling
274+
- Error scenarios
275+
276+
### Integration Tests
277+
278+
- End-to-end telemetry flow with retry
279+
- VSCode extension integration
280+
- Configuration changes
281+
- Network failure simulation
282+
283+
## Performance Considerations
284+
285+
### Memory Usage
286+
287+
- Queue size is limited to prevent unbounded growth
288+
- Events are stored efficiently with minimal metadata
289+
- Automatic cleanup of processed events
290+
291+
### CPU Usage
292+
293+
- Retry processing runs on a 30-second interval
294+
- Batch processing minimizes overhead
295+
- Exponential backoff reduces server load
296+
297+
### Network Usage
298+
299+
- Failed events are not retried immediately
300+
- Batch processing reduces connection overhead
301+
- Exponential backoff prevents server overload
302+
303+
## Security
304+
305+
### Data Protection
306+
307+
- Telemetry events may contain sensitive information
308+
- Events are stored locally only
309+
- No additional network exposure beyond normal telemetry
310+
311+
### Privacy
312+
313+
- Retry queue respects user telemetry preferences
314+
- Queue is cleared when telemetry is disabled
315+
- No additional data collection beyond original events
316+
317+
## Troubleshooting
318+
319+
### Common Issues
320+
321+
1. **Queue not working**: Check `telemetryRetryEnabled` setting
322+
2. **Too many notifications**: Disable `telemetryRetryNotifications`
323+
3. **Queue growing too large**: Reduce `telemetryRetryQueueSize`
324+
4. **Slow retry processing**: Reduce `telemetryRetryBaseDelay`
325+
326+
### Debugging
327+
328+
Enable debug logging by setting the telemetry client debug flag:
329+
330+
```typescript
331+
const client = new PostHogTelemetryClient(true) // Enable debug
332+
```
333+
334+
### Queue Inspection
335+
336+
Use the command palette:
337+
338+
1. Open Command Palette (Ctrl/Cmd + Shift + P)
339+
2. Run "Roo Code: Show Telemetry Queue"
340+
3. View queue status and management options
341+
342+
## Migration
343+
344+
### Existing Installations
345+
346+
The retry queue is automatically enabled for existing installations with default settings. No user action is required.
347+
348+
### Upgrading
349+
350+
When upgrading from versions without retry queue:
351+
352+
- Existing telemetry behavior is preserved
353+
- Retry queue is enabled with default settings
354+
- Users can disable via settings if desired
355+
356+
## Future Enhancements
357+
358+
Potential future improvements:
359+
360+
- Configurable retry strategies (linear, custom)
361+
- Queue analytics and reporting
362+
- Network condition detection
363+
- Intelligent batching based on connection quality
364+
- Event compression for large queues

packages/cloud/src/CloudService.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ export class CloudService {
5050

5151
this.telemetryClient = new TelemetryClient(this.authService, this.settingsService)
5252

53+
// Initialize retry queue for cloud telemetry client
54+
this.telemetryClient.initializeRetryQueue(this.context)
55+
5356
this.shareService = new ShareService(this.authService, this.settingsService, this.log)
5457

5558
try {

0 commit comments

Comments
 (0)