Skip to content

feat(browser): proof of concept — IndexedDB offline dead letter queue#3314

Draft
marquesds wants to merge 2 commits intoPostHog:mainfrom
marquesds:feat/offline-dlq
Draft

feat(browser): proof of concept — IndexedDB offline dead letter queue#3314
marquesds wants to merge 2 commits intoPostHog:mainfrom
marquesds:feat/offline-dlq

Conversation

@marquesds
Copy link
Copy Markdown

@marquesds marquesds commented Apr 1, 2026

Problem

Users of offline-first or intermittently-connected applications permanently lose analytics events when the in-memory retry queue exhausts its 10 attempts while the tab is still open. For field/mission apps — where users operate offline for extended periods — this means most captured events never reach PostHog.

Reported in #1583.

Changes

This is a proof of concept. The goal is to validate the approach, get early feedback on the design, and identify any concerns before investing in hardening, staged rollout, and Phase 2/3 enhancements.

This PR introduces an IndexedDB-backed dead letter queue (DLQ) that catches events after the existing RetryQueue gives up, persists them to disk, and re-sends them on the next page load or when the browser comes back online.

Architecture

flowchart TD
    A["posthog.capture()"] --> B["RequestQueue\n(batches events)"]
    B --> C["RetryQueue\n(up to 10 retries with\nexponential backoff)"]

    C -->|"✅ 200 OK"| D["Event delivered\nto PostHog"]
    C -->|"4xx client error"| E["Dropped\n(not retriable)"]
    C -->|"❌ 10 retries exhausted\n(5xx / network error)"| F{{"enable_offline_dlq?"}}

    F -->|false| G["🗑️ Event lost\n(current behavior)"]
    F -->|true| H["_onFlushFailure()"]

    H --> I[("IndexedDB\nOfflineDlq\n(posthog_dlq)")]

    J["🌐 Browser 'online' event"] --> K["_drainDlq()"]
    L["📄 Page load\n(3s delay)"] --> K

    K --> M{"is_capturing()?\n(consent check)"}
    M -->|"no (opted out)"| N["🗑️ Clear DLQ"]
    M -->|"yes"| O["evictExpired()\nenforceMaxEntries()"]
    O --> P["Read all → batch → send\nvia _send_retriable_request()"]
    P -->|"✅ success"| Q["Delete sent events\nfrom IndexedDB"]
    P -->|"❌ failure"| R["Stop drain\n(retry next trigger)"]

    style I fill:#f9e6a0,stroke:#d4a017
    style D fill:#d4edda,stroke:#28a745
    style G fill:#f8d7da,stroke:#dc3545
    style N fill:#f8d7da,stroke:#dc3545
Loading

What's included

New file: packages/browser/src/dlq.tsOfflineDlq class (~280 lines)

  • IndexedDB store (posthog_dlq database, events object store, uuid keyPath)
  • open(), write(), readAll(), delete(), evictExpired(), enforceMaxEntries(), clear(), close()
  • versionchange listener for graceful close when another tab upgrades the schema
  • Single-retry re-open when the DB connection is invalidated

Config options (added to PostHogConfig in @posthog/types)

  • enable_offline_dlq — opt-in toggle, defaults to false (zero risk for existing users)
  • dlq_max_age_hours (default 24), dlq_max_entries (default 1000)
  • dlq_drain_on_online (default true), dlq_drain_on_load (default true)

SDK integration (in posthog-core.ts and retry-queue.ts)

  • DLQ initializes after _retryQueue in _init(), gated by _shouldEnableDlq()
  • _onFlushFailure() called from RetryQueue when retries exhaust — writes failed events to DLQ
  • _drainDlq() with concurrent-drain guard, consent re-check, Web Locks (progressive), batched send
  • opt_out_capturing() clears the DLQ to respect privacy

Tests

  • 19 unit tests for OfflineDlq (using fake-indexeddb)
  • Existing retry-queue.test.ts updated and passing

What's NOT included (future phases)

  • Optimistic journal mode (write to IDB before flush for tab-close durability)
  • Remote config / feature flag gating for server-side rollout control
  • $$dlq_summary internal metric event
  • DLQ writes on unload (IDB is async, unreliable during teardown — sendBeacon remains the safety net)

Key design decisions

  • enable_offline_dlq: false by default — fully implemented but opt-in only until validated
  • Hooks at retry exhaustion, not first failure — the existing retry queue handles transient failures well; DLQ catches what survives 10 retries
  • No DLQ on unload — IDB transactions may not commit before page teardown; sendBeacon is the correct mechanism
  • store.put() for idempotent writes — natural dedup by UUID, simpler than store.add() + ConstraintError handling
  • Static import (not dynamic import()) — required for Rollup IIFE output compatibility

Release info Sub-libraries affected

Libraries affected

  • All of them
  • posthog-js (web)
  • posthog-js-lite (web lite)
  • posthog-node
  • posthog-react-native
  • @posthog/react
  • @posthog/ai
  • @posthog/convex
  • @posthog/next
  • @posthog/nextjs-config
  • @posthog/nuxt
  • @posthog/rollup-plugin
  • @posthog/webpack-plugin
  • @posthog/types

Checklist

  • Tests for new code
  • Accounted for the impact of any changes across different platforms
  • Accounted for backwards compatibility of any changes (no breaking changes!)
  • Took care not to unnecessarily increase the bundle size

If releasing new changes

  • Ran pnpm changeset to generate a changeset file
  • Added the "release" label to the PR to indicate we're publishing new versions for the affected packages

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 1, 2026

@marquesds is attempting to deploy a commit to the PostHog Team on Vercel.

A member of the Team first needs to authorize it.

@marquesds marquesds marked this pull request as draft April 1, 2026 11:13
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browser/src/posthog-core.ts
Line: 1180-1196

Comment:
**`stored_at` reset bypasses `dlq_max_age_hours` TTL**

When `_onFlushFailure` writes an already-in-DLQ event back (after a drain attempt exhausts its own retries), it sets `stored_at: Date.now()`. Because `write()` uses `store.put()` (upsert by UUID), the existing entry is silently replaced with a fresh timestamp. As a result, `evictExpired()` will never evict the event — its age keeps resetting with every failed drain, making `dlq_max_age_hours` effectively meaningless for events that fail repeatedly.

The original `stored_at` should be preserved when re-writing an event that is already in the DLQ. One option is to pass the original `stored_at` through the request options (e.g., as a custom field on the queued request) so `_onFlushFailure` can use it instead of `Date.now()`. Another option is to read the existing entry from the DLQ before writing and keep the oldest `stored_at`.

```typescript
// instead of always using now:
events.push({ uuid, data: item, stored_at: item._dlq_stored_at ?? now })
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browser/src/posthog-core.ts
Line: 1086-1095

Comment:
**`online` event listener is never removed**

`addEventListener(window, 'online', callback)` registers an anonymous arrow function that has no handle, so it can never be deregistered. If `_initDlq()` were ever called more than once (e.g., after a `reset()` + re-`init()`), multiple identical listeners would accumulate, each scheduling a drain on every connectivity change.

Consider storing the listener reference so it can be removed if needed, or guard `_initDlq` with a flag that prevents re-registration:

```typescript
const onlineHandler = () => setTimeout(() => this._drainDlq(), 1000)
addEventListener(window, 'online', onlineHandler)
// store this._dlqOnlineHandler = onlineHandler for future removeEventListener
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browser/src/__tests__/dlq.test.ts
Line: 108-115

Comment:
**Test name contradicts the implementation**

The test description "deduplicates via ConstraintError" implies a `ConstraintError` thrown by `store.add()`, but the implementation uses `store.put()`, which silently overwrites duplicate keys without any error. The test correctly verifies the deduplication behaviour, but the name will mislead future readers into thinking there is `add()`-based error handling.

```suggestion
        it('deduplicates via put upsert', async () => {
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "chore: add changeset for offline DLQ fea..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant