Skip to content

chore(scripts/migration): resumable Mixpanel raw-event export#7197

Open
mdmohsin7 wants to merge 2 commits intomainfrom
rex/mixpanel-export-script
Open

chore(scripts/migration): resumable Mixpanel raw-event export#7197
mdmohsin7 wants to merge 2 commits intomainfrom
rex/mixpanel-export-script

Conversation

@mdmohsin7
Copy link
Copy Markdown
Member

Summary

  • Adds scripts/migration/mixpanel_export.sh — streams data.mixpanel.com/api/2.0/export one UTC day at a time into gzipped JSONL chunks, paced under the 60-queries/hour rate limit. Designed for unattended overnight runs that can be safely interrupted and resumed.
  • Adds scripts/migration/README.md — operational guide: env vars, output layout, resume semantics, failure-handling matrix, run command (tmux), post-run verification.

Operational tooling for the Mixpanel → PostHog migration follow-on (PostHog managed migration takes service-account creds + a date range; this gives us a recoverable local archive of the same data — and feeds the same JSONL/S3 bundle PostHog's importer expects). Not imported by any service.

Why a script and not just curl

Manager directive: the export will run for ~14h unattended for the full Mar 2024 → today window (~793 daily chunks at the documented rate-limit floor). Safe resume on any failure mode (Ctrl+C, kill -9, OOM, VM reboot, network drop, 5xx) is the primary requirement.

Resume semantics drive off disk state, not the manifest:

  • Day "done" iff its events-YYYY-MM-DD.jsonl.gz exists and either is 0 bytes (legitimately empty day sentinel) or passes gzip -t.
  • Atomic writes: stream into .tmp, rename only after successful gzip + integrity. Crash mid-stream → .tmp is discarded, day re-fetched.
  • Manifest is informational. Even if the script crashes between rename and manifest append, the next run sees the file on disk and skips correctly.

Failure handling

Class Behavior
200 non-empty save chunk, manifest entry, advance
200 empty 0-byte sentinel + empty=true manifest entry
429 exp backoff, retry same day, does not consume retry budget
5xx / timeout exp backoff (60→120→240→480→960s), max 5 attempts; failure logged to failures.jsonl and auto-retried on next run
401 / 403 hard abort (creds bad)

Test plan

  • Syntax check: bash -n scripts/migration/mixpanel_export.sh (passes locally).
  • Dry-run shape: MP_SERVICE_USER=x MP_SERVICE_SECRET=y MP_PROJECT_ID=0 MP_OUT=/tmp/mp MP_END=$(date -u -d 'yesterday' +%F) MP_START=$MP_END bash scripts/migration/mixpanel_export.sh — expect auth-probe failure (401) and clean exit, lock released.
  • Once ops drops the real service-account creds: kick off in tmux on this VM, monitor manifest.jsonl line count, verify resume by Ctrl+C and re-launch mid-run.
  • Post-run: jq -s 'map(.lines // 0) | add' on manifest should approach Mixpanel $any_event total for the same window (small drift OK from late-arriving events).

Out of scope

  • The existing /scripts/ gitignore rule blocks files under scripts/. Force-added with -f to match how the existing tracked scripts in scripts/ got there. Not changing the gitignore in this PR — that's a separate cleanup.

mdmohsin7 and others added 2 commits May 6, 2026 14:50
Streams data.mixpanel.com/api/2.0/export one UTC day at a time, pacing
under the 60 queries/hour rate limit, into gzipped JSONL chunks for
handoff to PostHog managed migration (or cold archive of the historical
Mixpanel record before decommission).

Designed for unattended overnight runs:

- flock guard against concurrent invocations
- atomic per-day write (.tmp -> rename only after successful gzip)
- resume drives off disk state, not the manifest, so any crash mode
  (Ctrl+C, kill -9, OOM, VM reboot) leaves at most one in-flight day to
  redo
- 5-attempt exponential backoff on 5xx/timeout, infinite backoff on 429
  (does not consume retry budget), hard abort on 401/403
- 0-byte sentinel distinguishes legitimately empty days from
  unfetched ones
- optional GCS mirror per chunk via gsutil

Auth needs a Mixpanel service account (Org Settings -> Service
Accounts) with Consumer role on the target project. README documents
the full operational flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents env vars, output layout, resume semantics, failure-handling
matrix, run command (tmux), and post-run verification queries
(coverage gaps, manifest totals vs Mixpanel insights).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

Adds a new scripts/migration/mixpanel_export.sh script and companion README.md for a one-shot, resumable Mixpanel → JSONL/GCS archive export intended to feed PostHog's managed migration importer. The resume logic (disk-state as source of truth, atomic .tmpmv writes, gzip -t integrity checks) is well-designed, but there is a defect in the 429 rate-limit handler and a mismatch between 0-byte sentinels and the .jsonl.gz extension passed to GCS and the README's verification steps.

  • Unbounded 429 backoff: attempts is always decremented on a 429 so the inner retry loop never exits normally; backoff doubles without a cap, reaching ~17 h per sleep after just 10 consecutive 429s — this would stall the ~14 h overnight run entirely.
  • 0-byte sentinel GCS uploads: empty days produce a 0-byte file (not a valid gzip archive) that is then gsutil cp'd to GCS under a .jsonl.gz name; any gzip -d/zcat call by PostHog's importer or the README's own verification commands will fail on these files.

Confidence Score: 3/5

The script is safe to merge as a non-production migration tool, but should not be run against real credentials until the 429 backoff cap is added — a sustained rate-limit burst would freeze the entire unattended export without any recovery path.

The 429 handler resets attempts on every rate-limit response while backoff doubles without a ceiling, meaning a brief but sustained rate-limiting window turns the 14-hour overnight run into an indefinite hang. This is a real defect on the primary execution path. The 0-byte GCS upload issue is a secondary concern that would surface only when MP_GCS_BUCKET is set and PostHog’s importer encounters empty-day files.

scripts/migration/mixpanel_export.sh — specifically the 429 branch in fetch_one (lines 188–194) and the GCS upload guard (lines 221–226).

Important Files Changed

Filename Overview
scripts/migration/mixpanel_export.sh New resumable Mixpanel export script with solid atomic-write/resume semantics, but has an unbounded exponential backoff on 429 responses (no cap, attempts always reset) that can stall an overnight run indefinitely, and uploads 0-byte empty-day sentinels to GCS as .jsonl.gz files that will fail decompression.
scripts/migration/README.md Operational guide for the export script; well-structured with env-var reference, resume semantics, and failure-handling matrix, but the post-run verification zcat examples will fail for 0-byte empty-day sentinel files without a guard.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([Start]) --> B{Pre-flight checks\ndisk space + auth probe}
    B -->|fail| Z1([exit 1/2])
    B -->|pass| C[Build todo list\nskip is_day_done days]
    C --> D{todo empty?}
    D -->|yes| Z2([exit 0])
    D -->|no| E[fetch_one d]

    E --> F{curl response}
    F -->|200 empty| G[Write 0-byte sentinel\nappend manifest]
    F -->|200 non-empty| H[gzip raw to gz.tmp\nmv to final file\nappend manifest]
    F -->|429| I[sleep backoff\nbackoff x2 no cap\nattempts-- no limit]
    F -->|401/403| Z3([exit 2])
    F -->|5xx/timeout| J[sleep backoff\nbackoff x2\nattempts++]

    I --> F
    J --> K{attempts >= 5?}
    K -->|no| F
    K -->|yes| L[log GIVEUP\nwrite failures.jsonl]

    G --> M{MP_GCS_BUCKET set?}
    H --> M
    L --> N[failed++]
    M -->|yes| O[gsutil cp 0-byte files also uploaded]
    M -->|no| P[pace sleep]
    O --> P
    N --> P

    P --> Q{more days?}
    Q -->|yes| E
    Q -->|no| R[Print summary]
    R --> S{failed > 0?}
    S -->|yes| Z4([exit 1])
    S -->|no| Z5([exit 0])
Loading

Comments Outside Diff (3)

  1. scripts/migration/mixpanel_export.sh, line 188-194 (link)

    P1 The backoff variable doubles without a ceiling on every 429, and because attempts is always decremented the loop can spin indefinitely. After 10 consecutive 429s the accumulated sleep is already ~17 hours — completely stalling the intended overnight run. The README says "exp backoff (60→120→240→480→960s)" but nothing in the code enforces that ceiling for 429 specifically. Adding backoff=$(( backoff > 900 ? backoff : backoff * 2 )) keeps the maximum wait to 15 min.

  2. scripts/migration/mixpanel_export.sh, line 221-226 (link)

    P2 0-byte sentinel files uploaded to GCS as .jsonl.gz

    When fetch_one returns success for an empty day, $f is a 0-byte file — not a valid gzip archive. The GCS upload path checks only [[ -f "$f" ]], so it uploads these 0-byte sentinels with a .jsonl.gz extension. Any downstream consumer calling gzip -d or zcat on them (including PostHog's importer and the README's own verification examples) will fail with "not in gzip format". Consider skipping the upload for 0-byte files: [[ -s "$f" ]] instead of [[ -f "$f" ]].

  3. scripts/migration/README.md, line 137-138 (link)

    P2 zcat verification fails for 0-byte empty-day sentinels

    The spot-check commands run zcat unconditionally. For any legitimately empty day its file is a 0-byte non-gzip sentinel; zcat will exit non-zero and print "not in gzip format". This will silently bias event counts in the verification math (wc -l will be 0 from an error exit). A note to skip or handle 0-byte files (e.g., [[ -s "$f" ]] && zcat "$f" | wc -l) would prevent confusing output during post-run verification.

Reviews (1): Last reviewed commit: "docs(scripts/migration): operational REA..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant