chore(scripts/migration): resumable Mixpanel raw-event export#7197
chore(scripts/migration): resumable Mixpanel raw-event export#7197
Conversation
Streams data.mixpanel.com/api/2.0/export one UTC day at a time, pacing under the 60 queries/hour rate limit, into gzipped JSONL chunks for handoff to PostHog managed migration (or cold archive of the historical Mixpanel record before decommission). Designed for unattended overnight runs: - flock guard against concurrent invocations - atomic per-day write (.tmp -> rename only after successful gzip) - resume drives off disk state, not the manifest, so any crash mode (Ctrl+C, kill -9, OOM, VM reboot) leaves at most one in-flight day to redo - 5-attempt exponential backoff on 5xx/timeout, infinite backoff on 429 (does not consume retry budget), hard abort on 401/403 - 0-byte sentinel distinguishes legitimately empty days from unfetched ones - optional GCS mirror per chunk via gsutil Auth needs a Mixpanel service account (Org Settings -> Service Accounts) with Consumer role on the target project. README documents the full operational flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents env vars, output layout, resume semantics, failure-handling matrix, run command (tmux), and post-run verification queries (coverage gaps, manifest totals vs Mixpanel insights). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryAdds a new
Confidence Score: 3/5The script is safe to merge as a non-production migration tool, but should not be run against real credentials until the 429 backoff cap is added — a sustained rate-limit burst would freeze the entire unattended export without any recovery path. The 429 handler resets scripts/migration/mixpanel_export.sh — specifically the 429 branch in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([Start]) --> B{Pre-flight checks\ndisk space + auth probe}
B -->|fail| Z1([exit 1/2])
B -->|pass| C[Build todo list\nskip is_day_done days]
C --> D{todo empty?}
D -->|yes| Z2([exit 0])
D -->|no| E[fetch_one d]
E --> F{curl response}
F -->|200 empty| G[Write 0-byte sentinel\nappend manifest]
F -->|200 non-empty| H[gzip raw to gz.tmp\nmv to final file\nappend manifest]
F -->|429| I[sleep backoff\nbackoff x2 no cap\nattempts-- no limit]
F -->|401/403| Z3([exit 2])
F -->|5xx/timeout| J[sleep backoff\nbackoff x2\nattempts++]
I --> F
J --> K{attempts >= 5?}
K -->|no| F
K -->|yes| L[log GIVEUP\nwrite failures.jsonl]
G --> M{MP_GCS_BUCKET set?}
H --> M
L --> N[failed++]
M -->|yes| O[gsutil cp 0-byte files also uploaded]
M -->|no| P[pace sleep]
O --> P
N --> P
P --> Q{more days?}
Q -->|yes| E
Q -->|no| R[Print summary]
R --> S{failed > 0?}
S -->|yes| Z4([exit 1])
S -->|no| Z5([exit 0])
|
Summary
scripts/migration/mixpanel_export.sh— streamsdata.mixpanel.com/api/2.0/exportone UTC day at a time into gzipped JSONL chunks, paced under the 60-queries/hour rate limit. Designed for unattended overnight runs that can be safely interrupted and resumed.scripts/migration/README.md— operational guide: env vars, output layout, resume semantics, failure-handling matrix, run command (tmux), post-run verification.Operational tooling for the Mixpanel → PostHog migration follow-on (PostHog managed migration takes service-account creds + a date range; this gives us a recoverable local archive of the same data — and feeds the same JSONL/S3 bundle PostHog's importer expects). Not imported by any service.
Why a script and not just
curlManager directive: the export will run for ~14h unattended for the full Mar 2024 → today window (~793 daily chunks at the documented rate-limit floor). Safe resume on any failure mode (Ctrl+C, kill -9, OOM, VM reboot, network drop, 5xx) is the primary requirement.
Resume semantics drive off disk state, not the manifest:
events-YYYY-MM-DD.jsonl.gzexists and either is 0 bytes (legitimately empty day sentinel) or passesgzip -t..tmp, rename only after successful gzip + integrity. Crash mid-stream →.tmpis discarded, day re-fetched.Failure handling
200non-empty200emptyempty=truemanifest entry4295xx/ timeoutfailures.jsonland auto-retried on next run401/403Test plan
bash -n scripts/migration/mixpanel_export.sh(passes locally).MP_SERVICE_USER=x MP_SERVICE_SECRET=y MP_PROJECT_ID=0 MP_OUT=/tmp/mp MP_END=$(date -u -d 'yesterday' +%F) MP_START=$MP_END bash scripts/migration/mixpanel_export.sh— expect auth-probe failure (401) and clean exit, lock released.manifest.jsonlline count, verify resume by Ctrl+C and re-launch mid-run.jq -s 'map(.lines // 0) | add'on manifest should approach Mixpanel$any_eventtotal for the same window (small drift OK from late-arriving events).Out of scope
/scripts/gitignore rule blocks files underscripts/. Force-added with-fto match how the existing tracked scripts inscripts/got there. Not changing the gitignore in this PR — that's a separate cleanup.