Skip to content

reliability: implement checkpoint and resume for interrupted runs #7

@more-shubham

Description

@more-shubham

If processing crashes at image 800/1246, the next run restarts from zero. For large datasets this wastes significant compute time and is a production reliability gap.

Scope:

  • Create internal/checkpoint/checkpoint.go managing a checkpoint.json in the output directory
  • Track completed file paths with their output hash
  • On engine startup, load existing checkpoint and skip already-processed files
  • Atomic checkpoint writes (write to temp file, rename) to prevent corruption on crash
  • Add -no-resume flag to force full reprocessing

checkpoint.json shape:

{
  "version": 1,
  "started_at": "2026-02-18T10:00:00Z",
  "completed": ["images/photo1.jpg", "images/photo2.png"],
  "total_processed": 800
}

Acceptance Criteria:

  • Interrupted run resumes from last checkpoint on restart
  • checkpoint.json is never left in a corrupted state
  • -no-resume flag bypasses checkpoint entirely
  • Checkpoint file excluded from output metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions