Add --log-buffer for periodic buffer sync during scans#287
Add --log-buffer for periodic buffer sync during scans#287tbroadley wants to merge 7 commits intomeridianlabs-ai:mainfrom
Conversation
Flush intermediate scan results to the destination directory every N transcripts, providing crash-resilient resume and intermediate visibility. - RecorderBuffer accepts synced_ids to skip already-synced transcripts - FileRecorder.flush() compacts buffer parquets to scan dir - FileRecorder.resume() reads synced transcript IDs from existing parquets - --log-buffer / SCOUT_SCAN_LOG_BUFFER available on scan and scan resume CLI - scanner_table() runs on a worker thread during flush to avoid blocking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
scanner_table() compaction now runs on the event loop thread instead of a worker thread. This prevents races with concurrent record() calls that write new per-transcript parquets to the buffer directory. The remaining await points (fs.write_file) can still allow summary drift, but this is benign for resumption since _read_synced_ids reads from parquets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey @tbroadley I looked over your PR to get it review ready. Sadly I can't push to it I am a bit too lazy to fork and open a PR against your PR but this is the only change I would make: The Race condition and self-inconsistent state analysisThought about with CC whether The Three Potential Inconsistencies
AnalysisRace 1 (inline eliminates this): Losing a few transcripts per flush is acceptable. The real value of Races 2/3 (happen regardless due to await points): More concerning if downstream services can't reconcile discrepancies. But the window is tiny (~1-10ms during await points). All races: Fixed on resume. DecisionKeep inline execution. Reasons:
NoteThis is my first time looking at this codebase |
|
Test Report: --log-buffer Feature (PR #287) Feature Under Test The --log-buffer N CLI option enables periodic flushing of scan results to the destination directory every N transcripts, providing crash-resilient resume for long-running scans. Test Environment
Test Procedure
Findings ┌──────────────────────────────────────────┬─────────┐ Conclusion The --log-buffer feature works as designed. Crash-resilient resume from S3 storage is functional - interrupted scans can be resumed without re-processing already-flushed transcripts. |
|
I also did this with more data and it correctly flushed (does flushed have the right connotation here?) multiple parquet files |
- Save log_buffer to ScanOptions so it persists in _scan.json and is automatically restored on resume (addresses tbroadley review) - Change --log-buffer from type=int to click.IntRange(min=1) to reject zero/negative values at CLI level (addresses QuantumLove suggestion) - Update docstrings per review suggestions - Regenerate openapi.json and generated.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When resuming a scan on a different machine (or after disk loss), the local buffer is empty. Without seeding, scanner_table() compaction would overwrite remote parquets with only newly-scanned results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TODO think about races betweenRafael left some notes about this below.flushand other threads that are actively writing scan results to disk and maybe writing to_summary.jsonand_errors.jsonl.Summary
--log-buffer Noption toscout scanandscout scan resumethat flushes intermediate results to the scan directory every N transcriptsscanner_table()compaction on a worker thread during flush to avoid blocking the event loopTest plan
test_recorder_buffer.py)test_is_recorded_with_synced_ids,test_flush_writes_parquets_and_summary,test_read_synced_ids_from_parquets,test_resume_skips_synced_transcriptsruff checkandmypypass on all modified files🤖 Generated with Claude Code