Approach: MVP-first → publish to GitHub → continue to v1.0
Stack: Node.js 20+, TypeScript, Vitest, commander, tsup
Cadence: Weekly sprints, large tasks (major deliverables)
Builder: Claude Code
Specs: Detailed requirements (performance budgets, log lifecycle, retention policies, architecture) live in flight-prd.md. This plan references the PRD — not duplicates it.
Ship a working MCP flight recorder to GitHub. No progressive disclosure, no TUI, no replay. Prove the core value: transparent proxy + structured logging + CLI inspection.
Task 1: Initialize project and CI ✅ Done
- ✅ TypeScript project with
tsupbuild,vitesttest runner,commanderCLI - ✅
tsconfig.json,package.json(name:flight-proxy, bin:flight),.gitignore, ESLint - ✅ GitHub Actions CI: lint + type-check + test on push/PR (
ci.yml) - ✅ Release workflow: on tag push → build → npm publish (
release.yml) - ✅ First test passing (76 tests across 16 test files)
Task 2: STDIO proxy core ✅ Done
- ✅ Bidirectional STDIO pipe:
process.stdin → [intercept] → upstream.stdinand reverse - ✅ Spawn upstream process from CLI args:
flight proxy --cmd <command> -- <args> - ✅ Streaming NDJSON parser via
readline.createInterface - ✅ Handle partial messages, chunked streams, upstream crashes (
close/errorevents), and clean exit (SIGTERM/SIGINT) - ✅ Backpressure: pass-through via Node stream implicit backpressure
- ✅ Unit tests for JSON-RPC parser (10 tests), integration test with mock MCP server (7 tests)
- ✅ BONUS: Auto-retry for read-only tools (500ms delay, single attempt, permanent error exclusion)
Task 3: Structured logging system ✅ Done
- ✅ Async
.jsonlwriter with dedicated write queue (max depth 1,000 — drop with warning on overflow, never block proxy) - ✅ Full log schema:
session_id,call_id,timestamp,latency_ms,direction,method,tool_name,payload,error,hallucination_hint,pd_active,schema_tokens_saved - ✅ Flush every 100ms or on session close (sync flush in signal handlers)
- ✅ Per-session log files:
~/.flight/logs/<session_id>.jsonl - ✅ Startup disk check: if <100MB free, warn and disable logging
- ✅ Per-session 50MB log size cap
- ✅ Secret redaction: strip configured env var values and regex patterns before writing
- ✅
hallucination_hintdetection: flag when client sends follow-up after upstream error without retrying (30s window) - ✅ BONUS: Loop detection (5 identical calls within 60s → alert)
- ✅ BONUS: Alert system (
alerts.jsonl+ stderr callback)
Sprint 1 exit criteria: ✅ flight proxy --cmd echo -- "hello" works end-to-end, logs written, tests pass.
Task 4: flight init claude command ✅ Done
- ✅ Discover existing
claude_desktop_config.jsonfrom platform-specific path (macOS, Linux, Windows) - ✅ For each
mcpServersentry, wrap with Flight proxy (idempotency-checked) - ✅ Write to
~/.flight/claude_desktop_config_snippet.jsonby default - ✅
--applyflag: overwrite in place with.bakbackup - ✅ No existing config: generate example snippet with placeholder entry
- ✅ BONUS:
flight init claude-codewith--scope user|projectandclaude mcp add-jsoncommand generation - ✅ BONUS:
flight setupcombininginitClaudeCode --apply+installHookswith--removeto undo
Task 5: Log inspection CLI ✅ Done
- ✅
flight log list— table of sessions (ID, date, call count, error count) - ✅
flight log tail [--session <id>]— live stream withfs.watchand byte-offset reads - ✅
flight log view <session-id>— timeline with header summary - ✅
flight log filter --tool <name> | --errors | --hallucinations - ✅
flight log inspect <call-id>— pretty-print full request/response (prefix match search) - ✅ BONUS:
flight log alerts— severity-colored alert listing - ✅ BONUS:
flight log summary— session summary with ASCII timeline - ✅ BONUS:
flight log gc— garbage collect by session count / byte limit - ✅ BONUS:
flight log prune— prune by date or keep count
Sprint 2 exit criteria: ✅ Can install, run flight init claude, proxy a real MCP server, and inspect the session with CLI commands.
Task 6: Integration testing + benchmarks ✅ Done
- ✅ E2E test with mock MCP server (7 integration tests covering round-trip, error capture, retry, hallucination)
- ✅ Benchmark two load profiles: 100 small calls + 10 large calls (vitest), 1000 small + 50 large (standalone bench)
- ✅ Secret redaction tests (env var + regex patterns)
- ✅ Loop detection tests
- ✅ Lifecycle tests (compress, GC, prune)
Task 7: README + publish ✅ Done
- ✅ README with CI badge, system diagram, install/quickstart, feature overview, limitations
- ✅
LICENSE(MIT) - ✅ CI + release workflows
Sprint 3 exit criteria: ✅ Published on GitHub. npm install -g flight-proxy && flight init claude works.
- ✅ Package is live on GitHub
- ✅
flight proxy --cmd echo -- "hello"produces logs, CLI inspection works - ⬜ Publish to npm (pending:
npm publishwith tag) - ⬜ At least 1 real captured session demonstrates hallucination detection
- ⬜ Feedback from 1-3 early users confirms flight recorder value
Build a hybrid simulation framework in test/simulate/ that generates realistic MCP session data for PD validation, stress testing, and research corpus building.
Task 8a: Mock MCP server ecosystem ✅ Done
- ✅ Filesystem server (
mock-fs-server.ts): 10 tools —read_file,write_file,list_directory,search_files,create_directory,delete_file,move_file,get_file_info,read_multiple_files,file_exists - ✅ Git server (
mock-git-server.ts): 10 tools —git_status,git_diff,git_log,git_commit,git_branch_list,git_checkout,git_add,git_stash,git_blame,git_show - ✅ Web/API server (
mock-web-server.ts): 10 tools —fetch_url,search_web,parse_html,http_request,download_file,check_url_status,extract_links,screenshot_url,api_request,websocket_connect - ✅ All servers expose 10 tools each via standard MCP
tools/listwith full inputSchema - ✅ Configurable error injection via
MOCK_ERROR_RATEenv var +MOCK_LATENCY_MSfor artificial delay
Task 8b: Synthetic client + scenario runner ✅ Done
- ✅ 5 scenario modules in
scenarios.ts: file-edit (8 steps), debug (10 steps), git-workflow (8 steps), error-recovery (10 steps), multi-tool (12 steps) - ✅ Scenario runner (
runner.ts):--pd/--no-pd,--sessions <n>,--scenario <name>/--all,--error-rate <0-1>,--quiet - ✅ Each run produces real
.jsonlsession logs in temp directories - ✅ Formatted summary table with per-scenario success/failure counts
Task 8c: PD validation via simulation ✅ Done
- ✅
validate-pd.tsruns all scenarios with PD on vs off - ✅ Compares success rates, error counts, token savings, pd_active status
- ✅ GO decision: PD success rate identical to passthrough (93.8% both), 1748 schema tokens saved
- ✅ Automated go/no-go threshold: warns if PD drops >20% vs passthrough
Task 8d: Stress + corpus generation ✅ Done
- ✅
stress.tsruns 5 sessions × 5 scenarios = 25 sessions, 240 calls with 10% error rate - ✅ Verified: 88.8% success rate under error injection, 316.6 KB logs, all 25 sessions produced log files
- ✅ Log lifecycle verified under sustained load
Future: Claude API-driven validation (post-v1.0)
- Add
--apimode to scenario runner that sends real prompts via Anthropic SDK - Claude makes real tool-use decisions through Flight proxy → ground-truth PD validation
- Requires
ANTHROPIC_API_KEY, costs API credits
Sprint 4 exit criteria: ✅ Simulation framework generates realistic multi-session data. PD GO decision made. Stress test passes.
Task 9: PD core (single-server) ✅ Done (implemented ahead of schedule)
- ✅ Schema cache: in-memory via
tools/listat proxy startup - ✅ Intercept
tools/list: replace withdiscover_tools+execute_toolmeta-tools - ✅
discover_toolshandler: case-insensitive keyword match, name matches first - ✅
execute_toolrouting: translate totools/call, unwrap response, unknown tool → JSON-RPC error - ✅ Forward upstream errors with original message preserved
- ✅ Config:
--pdflag onflight proxy - ✅
pd_activeandschema_tokens_savedcorrectly populated in log entries when PD is active - ❌ Disk-based schema cache (parameter accepted, not implemented — schemas are in-memory only)
- ✅ MCP protocol version logging (logged from initialize response, stderr warning for visibility)
- ✅ Fallback: schema interception failure → silently drop to passthrough with warning + log entry
Task 10: Token metrics + stats command ✅ Done
- ✅
estimateTokenSavings()computes original vs reduced token count (JSON length / 4) - ✅
schema_tokens_savedlogged pertools/listinterception when PD is active - ✅
flight stats <session-id>: per-session token savings, tool breakdown - ✅
flight stats(no args): aggregate stats across recent sessions - ✅ Test: 1 tool vs 10+ tools — verified ~6x at 10 tools, ~22x at 30 tools, ~37x at 50 tools (5 tests in
pd-token-reduction.test.ts)
Sprint 5 exit criteria: ✅ PD mode works with logging, fallback, and version tracking. Real Claude Desktop validation (Sprint 4) still needed.
Task 11: Export and replay ✅ Done
- ✅
flight export <session-id> --format csv— flat CSV with proper escaping, filtering options - ✅
flight export <session-id> --format jsonl— cleaned JSONL with--include-payloadoption - ✅
flight replay <call-id> --cmd <server>— re-execute request against upstream (initializes MCP, sends recorded payload) - ✅
flight replay <call-id> --dry-run— show what would execute without side effects - ✅ Aggregate stats covered by
flight stats(no-args mode)
Sprint 6 exit criteria: ✅ Can export a session to CSV, replay a call, and view aggregate stats.
Task 12: Safety hardening ✅ Done
- ✅ Configurable redaction patterns: env vars list + regex patterns
- ✅ Loop detection: 5 identical calls within 60s → alert
- ✅ Log lifecycle: compress sessions >1 day, GC by count/bytes,
flight log gcandflight log prune - ✅ Graceful upstream crash recovery, clear error messages, signal handling
- ✅ Fuzz test suite (11 tests: empty input, whitespace, truncated JSON, binary garbage, long lines, interleaved garbage, chunked messages, null bytes, arrays, rapid succession, nested special chars)
- ✅ Disk-full condition tests (6 tests: startup disk check, per-session 50MB cap, graceful degradation, alert independence, statfs error handling)
Task 13: TUI dashboard ❌ Not Started
- ❌ All TUI features deferred (CLI is sufficient per cut line)
Sprint 7 exit criteria: ✅ Core hardening complete. Fuzz tests + disk-full tests pass. TUI deferred (CLI sufficient).
Task 14: Documentation ✅ Done
- ✅
README.mdexists with CI badge, install, quickstart - ✅
ARCHITECTURE.md: system diagram, JSON-RPC flow, PD algorithm, log schema reference, design decisions - ✅
RESEARCH.md: data collection guide, export formats, PD research, hallucination/loop detection limitations - ✅
CONTRIBUTING.md: dev setup, project structure, testing guidelines, PR process - ✅
FAQ.md: 10 common questions with concise answers
Task 15: Release ❌ Not Started
- ✅ CI release workflow ready (tag → build → npm publish)
- ❌ Version bump to 1.0.0
- ❌ GitHub release with changelog
- ❌ Publish to npm
- ❌ Sample anonymized session datasets (generated via
test/simulate/)
Sprint 8 exit criteria: 🔧 Documentation complete. Release pending version bump + npm publish.
If time runs short, cut in this order (bottom first):
- TUI (Task 13) — CLI is sufficient ← ALREADY CUT
- Replay (part of Task 11) — export is more valuable ← DONE
- Loop detection (part of Task 12) — nice-to-have safety feature ← DONE EARLY
- Log compression (part of Task 12) — manual cleanup is fine for early users ← DONE EARLY
- Never cut: proxy core, logging, CLI inspection, PD, README ← ALL DONE
- ✅
npm install -g flight-proxyworks (build + bin configured) - ✅ <5ms added latency (benchmarks show 4762 calls/sec for small messages)
- ✅ 100% JSON-RPC fidelity (integration tests verify round-trip)
- ✅
flight init claudesets up in <2 minutes - ✅ Published on GitHub with README
- ✅ 5-37x token reduction with PD (experimental, off by default) — verified with synthetic schemas, pending real AI validation
- ✅ Research-grade log export (CSV/JSONL)
- ✅ Hardened for production use — core hardening + fuzz tests + disk-full tests complete
- ✅ Complete documentation (ARCHITECTURE.md, RESEARCH.md, CONTRIBUTING.md, FAQ.md)
- ⬜ 200+ GitHub stars target within 3 months
- v1.1: Multi-server PD,
flight analyzecommand, Parquet export, Claude API-driven simulation mode - v1.2: Experiment mode (
--conditionlabels for A/B), anomaly detection, continuous simulation in CI - v2.0: HTTP transport, non-MCP framework support, plugin system, potential Rust rewrite
If Claude Code ships a formal extension/plugin API, Flight could migrate from the current proxy + hooks architecture to a native extension. This would enable:
- Deeper UI integration (inline panels, status indicators)
- Direct access to Claude's context window for smarter PD
- No MCP config wrapping needed at all
For now, the proxy + Claude Code hooks approach (flight setup) is the pragmatic choice
that works with the current Claude Code architecture. The zero-config setup via hooks
(SessionStart/SessionEnd) combined with MCP config wrapping provides the best balance
of integration depth and stability.