Skip to content

Conversation

@hannesrudolph
Copy link
Collaborator

@hannesrudolph hannesrudolph commented Jul 30, 2025

Summary

Implements a persistent retry queue for failed telemetry events to prevent data loss during network outages or connectivity issues. The queue persists events across VSCode restarts and implements exponential backoff for retries.

Fixes #4940

Changes

New Features

  • Added TelemetryQueue class with persistent storage using VSCode's globalState
  • Implemented exponential backoff retry logic (1s initial, 5min max)
  • Added queue size limits (1000 events max) with FIFO eviction
  • Integrated queue with both PostHog and Cloud telemetry clients

Modified Files

  • packages/telemetry/src/TelemetryQueue.ts: Core queue implementation
  • packages/telemetry/src/BaseTelemetryClient.ts: Added queue integration and abstract captureWithRetry method
  • packages/telemetry/src/PostHogTelemetryClient.ts: Implemented retry logic and queue integration
  • packages/cloud/src/TelemetryClient.ts: Implemented retry logic and queue integration
  • packages/telemetry/src/TelemetryService.ts: Added queue initialization and management
  • src/extension.ts: Initialize queue on activation, flush on deactivation
  • packages/telemetry/src/index.ts: Export TelemetryQueue

Tests

  • Added comprehensive test coverage (99 tests total)
  • TelemetryQueue.test.ts: 20 tests for queue functionality
  • PostHogTelemetryClient.test.ts: Updated with queue integration tests
  • TelemetryService.test.ts: 23 tests including queue coordination
  • TelemetryClient.test.ts: Updated with queue integration tests

Testing

Automated Tests

All tests passing:

✓ TelemetryQueue: 20 tests
✓ PostHogTelemetryClient: 21 tests  
✓ TelemetryService: 23 tests
✓ Cloud TelemetryClient: 35 tests
Total: 99 tests passed

Manual Testing Checklist

  • Test network failure recovery

    1. Disconnect network
    2. Trigger telemetry events
    3. Reconnect network
    4. Verify events are sent
  • Test extension restart with queued events

    1. Queue some events (disconnect network)
    2. Restart VSCode
    3. Verify queued events persist and retry
  • Test queue size limits

    1. Generate >1000 failed events
    2. Verify oldest events are dropped (FIFO)
  • Test graceful shutdown

    1. Queue some events
    2. Close VSCode
    3. Verify shutdown completes within 5 seconds

Review Notes

Issues Fixed During Review

  1. Console Logging: Removed all console.warn/error statements from telemetry code
  2. Private API Access: Removed access to PostHog's private _events property
  3. Error Handling: Added proper error handling for queue initialization and shutdown
  4. Test Updates: Updated tests to match new behavior

Design Decisions

  • RefreshTimer Duplication: The timer/backoff logic was intentionally duplicated from packages/cloud/src/RefreshTimer.ts to avoid circular dependencies between packages. This could be refactored to a shared utility package in the future.

Success Criteria Met

✅ Events queued when network fails
✅ Retry with exponential backoff
✅ Queue persists across restarts
✅ Queue has size limits (1000 events)
✅ Graceful shutdown with timeout
✅ User visibility via getQueueStatus
✅ Minimal performance impact
✅ Comprehensive test coverage
✅ No breaking changes
✅ Proper error handling

Screenshots/Recordings

N/A - Backend telemetry feature

Related Issues

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Tests added/updated
  • No breaking changes
  • Documentation updated (if needed)
  • Addressed review feedback (internal review completed)

Important

Introduces a persistent retry queue for telemetry events with exponential backoff, integrating with existing telemetry clients and ensuring data persistence across VSCode restarts.

  • Behavior:
    • Adds TelemetryQueue class for persistent retry of telemetry events using VSCode's globalState.
    • Implements exponential backoff for retries (1s initial, 5min max).
    • Limits queue size to 1000 events with FIFO eviction.
    • Integrates with PostHogTelemetryClient and CloudTelemetryClient.
  • Files Modified:
    • TelemetryQueue.ts: Core queue implementation.
    • BaseTelemetryClient.ts, PostHogTelemetryClient.ts, TelemetryClient.ts: Add queue integration and retry logic.
    • TelemetryService.ts: Manages queue initialization and shutdown.
    • extension.ts: Initializes queue on activation, flushes on deactivation.
  • Tests:
    • Adds 99 tests across TelemetryQueue.test.ts, PostHogTelemetryClient.test.ts, TelemetryService.test.ts, and TelemetryClient.test.ts.
    • Tests cover queue functionality, integration, and error handling.

This description was created by Ellipsis for ea718de. You can customize this summary. It will automatically update as commits are pushed.

- Implement TelemetryQueue class with VSCode globalState persistence
- Add exponential backoff retry logic (1s to 5min)
- Queue failed events from PostHog and Cloud telemetry clients
- Implement queue size limits (1000 events max, ~1MB disk usage)
- Add graceful shutdown with 5-second timeout
- Ensure queue persists across extension restarts
- Add comprehensive test coverage (99 tests passing)
Copilot AI review requested due to automatic review settings July 30, 2025 23:25
@hannesrudolph hannesrudolph requested review from cte, jr and mrubens as code owners July 30, 2025 23:25
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Jul 30, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a persistent retry queue for failed telemetry events to prevent data loss during network outages or connectivity issues. The queue persists events across VSCode restarts and implements exponential backoff for retries.

Key changes:

  • Added TelemetryQueue class with persistent storage and exponential backoff retry logic
  • Modified telemetry clients to integrate with the queue for failed events
  • Added comprehensive test coverage for queue functionality and client integration

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/extension.ts Initializes telemetry queue on activation and flushes on deactivation
packages/telemetry/src/TelemetryQueue.ts Core queue implementation with persistent storage and retry logic
packages/telemetry/src/TelemetryService.ts Added queue initialization, management, and shutdown methods
packages/telemetry/src/BaseTelemetryClient.ts Added abstract captureWithRetry method and queue integration
packages/telemetry/src/PostHogTelemetryClient.ts Implemented retry logic and queue integration
packages/cloud/src/TelemetryClient.ts Implemented retry logic and queue integration
packages/telemetry/src/index.ts Exported TelemetryQueue class
Test files Added comprehensive test coverage for queue and client integration

}

// Attempt to send using the client's retry method
return await client["captureWithRetry"](event.event)
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bracket notation to access a protected method is a code smell. Consider making captureWithRetry public or providing a proper public method for retry operations.

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +261
if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") {
return true
}
if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") {
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using constructor.name for type checking is fragile and can break with minification. Consider using instanceof checks or adding a type property to the clients.

Suggested change
if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") {
return true
}
if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") {
if (event.clientType === "posthog" && c.type === "posthog") {
return true
}
if (event.clientType === "cloud" && c.type === "cloud") {

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +261
if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") {
return true
}
if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") {
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using constructor.name for type checking is fragile and can break with minification. Consider using instanceof checks or adding a type property to the clients.

Suggested change
if (event.clientType === "posthog" && c.constructor.name === "PostHogTelemetryClient") {
return true
}
if (event.clientType === "cloud" && c.constructor.name === "TelemetryClient") {
if (event.clientType === "posthog" && c instanceof PostHogTelemetryClient) {
return true
}
if (event.clientType === "cloud" && c instanceof TelemetryClient) {

Copilot uses AI. Check for mistakes.
// Distribute queue to all clients
this.clients.forEach((client) => {
if (client instanceof BaseTelemetryClient) {
client.setQueue(this.queue!)
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the non-null assertion operator (!) can be dangerous. Consider checking if queue exists before using it or restructuring the code to ensure queue is always defined at this point.

Suggested change
client.setQueue(this.queue!)
if (this.queue) {
client.setQueue(this.queue);
} else {
console.error("Queue is not initialized. Unable to set queue for client.");
}

Copilot uses AI. Check for mistakes.
event.retryCount++
event.lastAttempt = now
// Calculate next attempt with exponential backoff
const backoffMs = Math.min(1000 * Math.pow(2, event.retryCount), 300000) // Max 5 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing the literal values (1000 and 300000) in the exponential backoff calculation with the defined constants (TelemetryQueue.INITIAL_BACKOFF_MS and TelemetryQueue.MAX_BACKOFF_MS) to ensure consistency and easier maintenance.

// Set up retry callback
this.queue.setRetryCallback(async (event: QueuedTelemetryEvent) => {
// Find the appropriate client for this event
const client = this.clients.find((c) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on c.constructor.name to determine client type is brittle. Consider using an explicit identifier (or instanceof checks) to distinguish between PostHog and Cloud telemetry clients.

}

// Attempt to send using the client's retry method
return await client["captureWithRetry"](event.event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the protected method captureWithRetry using bracket notation (client["captureWithRetry"]) can be fragile. Consider exposing a dedicated public retry method or refactoring to avoid bypassing access modifiers.

Copy link
Contributor

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing the persistent retry queue for telemetry events! This is a solid implementation that addresses the requirements from issue #4940. I've reviewed the changes and found several areas that could benefit from improvements to enhance code quality and prevent potential issues.


// Start the timer if not already running
if (!this.isRunning) {
this.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? The start() method is called without checking if processing is already in progress. Could we add a guard to prevent multiple timers from running simultaneously? This could lead to race conditions where multiple executeCallback() calls overlap.

}

// Attempt to send using the client's retry method
return await client["captureWithRetry"](event.event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we improve the type safety here? Direct access to the private method captureWithRetry using bracket notation bypasses TypeScript's protection. Consider making this method protected or adding a public retry interface to BaseTelemetryClient.

event.retryCount++
event.lastAttempt = now
// Calculate next attempt with exponential backoff
const backoffMs = Math.min(1000 * Math.pow(2, event.retryCount), 300000) // Max 5 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this backoff calculation intentionally different from the one on line 241? We have two different exponential backoff formulas: this one uses event.retryCount directly, while the other uses attemptCount. Could we standardize these to avoid confusion?

event.retryCount++
event.lastAttempt = now
// Calculate next attempt with exponential backoff
const backoffMs = Math.min(1000 * Math.pow(2, event.retryCount), 300000) // Max 5 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract this backoff calculation into a helper method? The same formula Math.min(1000 * Math.pow(2, event.retryCount), 300000) is repeated on line 141. This would improve maintainability and reduce the chance of inconsistencies.

* @param event The telemetry event that failed to send
* @param clientType The client type that should process this event
*/
public async addEvent(event: TelemetryEvent, clientType: "posthog" | "cloud"): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add validation for the event parameter? There's no check that event is not null/undefined before processing, which could cause runtime errors if called incorrectly.

this.queue = new TelemetryQueue(context)

// Set up retry callback
this.queue.setRetryCallback(async (event: QueuedTelemetryEvent) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this closure pattern intentional? The retry callback creates a closure over the entire service instance, which could prevent garbage collection. Could we consider using a WeakRef or extracting this to a separate method to reduce memory retention?

private static readonly MAX_QUEUE_SIZE = 1000
private static readonly MAX_RETRY_COUNT = 10
private static readonly SUCCESS_INTERVAL = 30000 // 30 seconds
private static readonly INITIAL_BACKOFF_MS = 1000 // 1 second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add documentation for these constants? Values like 1000 and 300000 would benefit from comments explaining they represent 1 second and 5 minutes respectively. Consider making them configurable if different retry strategies might be needed in the future.

/**
* Starts the queue processing timer
*/
public start(): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add JSDoc documentation for this public method? It would help other developers understand when and why to call start(), especially since it's called automatically in addEvent().

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 30, 2025
@hannesrudolph hannesrudolph moved this from Triage to PR [Draft / In Progress] in Roo Code Roadmap Jul 31, 2025
@hannesrudolph hannesrudolph added PR - Draft / In Progress and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 31, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Aug 1, 2025
@github-project-automation github-project-automation bot moved this from PR [Draft / In Progress] to Done in Roo Code Roadmap Aug 1, 2025
@hannesrudolph hannesrudolph deleted the fix/issue-4940-telemetry-retry-queue branch August 1, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PR - Draft / In Progress size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Add persistent retry queue for failed telemetry events to Roo Code Cloud

2 participants