Merge branch 'feat/dedup' into 'main'

osisdie · osisdie · commit 21aacfdaf139 · 2026-03-13T02:51:43.000Z
feat: add per-step state tracking and video date metadata to YT digest

See merge request nwpie/vibe/ai-claude-loop!4
diff --git a/.claude/commands/ai-news-digest-yt.md b/.claude/commands/ai-news-digest-yt.md
@@ -2,6 +2,14 @@ You are an AI news digest agent specializing in YouTube content from @AIDailyBri
 
 Today's date is {{date}}.
 
+## Step 0: Load State & Identify Incomplete Work
+
+Read `.state/last-digest-yt.json` if it exists. Parse the `video_status` dict (treat as empty `{}` if missing — backward compatible with old state files).
+
+For each entry in `video_status` where `completed` is `false` (or missing), note which steps are already `true` — these will be **skipped** when processing that video later. Incomplete videos are NOT in `posted_video_ids`, so `fetch_recent_videos.py` will re-fetch them automatically.
+
+After each per-video step completes (Steps 2–6), **immediately** save state with that step marked `true` in `video_status`. This ensures progress is preserved if the pipeline crashes mid-run.
+
 ## Step 1: Fetch Recent Videos
 
 Run the fetch script to get videos from the last 24 hours:
@@ -34,6 +42,8 @@ python scripts/yt/get_transcript.py VIDEO_ID
 
 Capture the stdout output as the transcript text. If a video's transcript fails, log a warning and skip that video — continue with others.
 
+After each successful transcript extraction, update `video_status[VIDEO_ID].steps.transcript = true` and save state. If Step 0 shows `transcript: true` for a video, skip transcript extraction and reuse the existing transcript file from `digest-yt/{{date}}/`.
+
 ## Step 2.5: Download Thumbnails
 
 For each video, download the YouTube thumbnail to the digest directory:
@@ -49,6 +59,8 @@ https://i.ytimg.com/vi/VIDEO_ID/hqdefault.jpg
 
 If a thumbnail download fails, continue without it — the summary will just lack an image.
 
+After each successful thumbnail download, update `video_status[VIDEO_ID].steps.thumbnail = true` and save state. Skip if already `true` from Step 0.
+
 ## Step 3: Summarize (Claude does this)
 
 For each video with a successful transcript, YOU (Claude) will:
@@ -58,15 +70,16 @@ For each video with a successful transcript, YOU (Claude) will:
    - **English summary**: 2-3 sentences covering the key points
    - **繁體中文摘要**: 2-3 sentences in Traditional Chinese covering the same points
 
-3. Save each summary as markdown to `digest-yt/{{date}}/VIDEO_ID.md` with this format:
+3. Save each summary as markdown to `digest-yt/{{date}}/VIDEO_ID.md` with this format. Convert `upload_date` from YYYYMMDD → YYYY-MM-DD for display. Only include the **Last Modified** line if `modified_date` is present and different from `upload_date`:
 
 ```markdown
 # Video Title
 
 ![Video Title](VIDEO_ID_thumb.jpg)
 
 **Source**: [AI Daily Brief](https://youtube.com/watch?v=VIDEO_ID)
-**Date**: {{date}}
+**Published**: 2026-03-12
+**Last Modified**: 2026-03-13
 
 ## English Summary
 
@@ -79,32 +92,36 @@ For each video with a successful transcript, YOU (Claude) will:
 
 4. Also create two combined digest files. **Each video section MUST include its thumbnail image** (use relative path). If the thumbnail file doesn't exist, omit the image line for that video.
 
-**`digest-yt/{{date}}/summary_en.md`** — All English summaries combined:
+**`digest-yt/{{date}}/summary_en.md`** — All English summaries combined. Include `*Published: YYYY-MM-DD*` (and `| Modified: YYYY-MM-DD` only when `modified_date` is present) below each video heading:
 ```markdown
 # AI Daily Brief - YouTube Digest {{date}}
 
 ## Video Title 1
 ![Video Title 1](VIDEO_ID_thumb.jpg)
+*Published: 2026-03-12 | Modified: 2026-03-13*
 
 2-3 sentence English summary...
 
 ## Video Title 2
 ![Video Title 2](VIDEO_ID_thumb.jpg)
+*Published: 2026-03-12*
 
 2-3 sentence English summary...
 ```
 
-**`digest-yt/{{date}}/summary_zh-tw.md`** — All zh-TW summaries combined:
+**`digest-yt/{{date}}/summary_zh-tw.md`** — All zh-TW summaries combined, same date format:
 ```markdown
 # AI Daily Brief - YouTube 摘要 {{date}}
 
 ## Video Title 1
 ![Video Title 1](VIDEO_ID_thumb.jpg)
+*Published: 2026-03-12 | Modified: 2026-03-13*
 
 繁體中文摘要...
 
 ## Video Title 2
 ![Video Title 2](VIDEO_ID_thumb.jpg)
+*Published: 2026-03-12*
 
 繁體中文摘要...
 ```
@@ -114,6 +131,8 @@ Create the `digest-yt/{{date}}/` directory first:
 mkdir -p "digest-yt/{{date}}"
 ```
 
+After writing summaries for each video, update `video_status[VIDEO_ID].steps.summary = true` and save state. Skip summary generation for videos where `summary: true` from Step 0 — reuse existing markdown files.
+
 ## Step 4: Generate HTML and PDF
 
 For each language (en, zh-tw), build HTML then PDF:
@@ -130,6 +149,8 @@ python scripts/yt/build_pdf.py "digest-yt/{{date}}/summary_zh-tw.html" -o "diges
 
 If PDF generation fails, note this and continue — you'll post without PDF links.
 
+After successful HTML+PDF generation, update `video_status[VIDEO_ID].steps.html = true` and `video_status[VIDEO_ID].steps.pdf = true` for all videos, then save state. Skip if already `true` from Step 0.
+
 ## Step 5: Upload PDFs to B2
 
 Upload each PDF to Backblaze B2:
@@ -141,6 +162,8 @@ python scripts/yt/upload_b2.py "digest-yt/{{date}}/summary_zh-tw_$(date +%Y%m%d)
 
 Capture the download URLs from stdout. If upload fails, continue without links.
 
+After successful uploads, update `video_status[VIDEO_ID].steps.b2_upload = true` for all videos, then save state. Skip if already `true` from Step 0.
+
 ## Step 6: Post to Slack
 
 Build a Slack mrkdwn message and post it. Use this exact format:
@@ -184,9 +207,11 @@ slack_send "$MSG"
 
 If Slack fails, retry once. If it fails again, save to `.state/failed-digest-yt-{{date}}.md`.
 
-## Step 7: Save State
+After successful Slack post, update `video_status[VIDEO_ID].steps.slack_post = true` and `video_status[VIDEO_ID].completed = true` for each video, then save state.
+
+## Step 7: Final State Save
 
-Write `.state/last-digest-yt.json`:
+Write `.state/last-digest-yt.json` with the full schema. Only add a video to `posted_video_ids` when its `completed` flag is `true` (all steps including slack_post succeeded):
 
 ```json
 {
@@ -195,17 +220,33 @@ Write `.state/last-digest-yt.json`:
   "posted_urls": [
     "https://youtube.com/watch?v=id1",
     "https://youtube.com/watch?v=id2"
-  ]
+  ],
+  "video_status": {
+    "id1": {
+      "title": "Video Title Here",
+      "steps": {
+        "transcript": true,
+        "thumbnail": true,
+        "summary": true,
+        "html": true,
+        "pdf": true,
+        "b2_upload": true,
+        "slack_post": true
+      },
+      "completed": true,
+      "last_updated": "YYYY-MM-DDTHH:MM:SSZ"
+    }
+  }
 }
 ```
 
-Merge with existing state (keep max 30 video IDs from current + previous runs).
+Merge with existing state. Keep max 30 entries in both `posted_video_ids` and `video_status` (trim oldest together). Backward compatible: if existing state lacks `video_status`, treat as empty `{}`.
 
 ```bash
 mkdir -p .state
 ```
 
-**Always save state**, even on partial failure.
+**Always save state**, even on partial failure. Incomplete videos stay in `video_status` (with `completed: false`) but are NOT added to `posted_video_ids` — so they will be re-fetched on the next run.
 
 ## Error Handling Summary
 
@@ -215,3 +256,4 @@ mkdir -p .state
 - B2 upload fails → post without download links, files stay local in `digest-yt/`
 - Slack fails → retry once, then save to `.state/failed-digest-yt-{{date}}.md`
 - State always saved even on partial failure
+- **Resume on retry**: incomplete videos stay in `video_status` but not in `posted_video_ids` → re-fetched next run → completed steps skipped via `video_status.steps` booleans
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -63,6 +63,9 @@ digest-yt/YYYY-MM-DD/                  — Local output (gitignored)
 ## State Tracking
 
 - `.state/last-digest-news.json` tracks posted URLs for deduplication (news digest)
-- `.state/last-digest-yt.json` tracks posted video IDs for deduplication (YT digest)
+- `.state/last-digest-yt.json` tracks posted video IDs + per-video step completion for deduplication and resume (YT digest)
+  - `posted_video_ids`: videos fully completed (all steps including slack_post)
+  - `video_status`: per-video step booleans (transcript, thumbnail, summary, html, pdf, b2_upload, slack_post) — enables resume on partial failure
+  - Incomplete videos stay in `video_status` but NOT in `posted_video_ids` → re-fetched on retry
 - Keep max 30 entries (current + previous digest)
 - State directory is gitignored
diff --git a/scripts/yt/fetch_recent_videos.py b/scripts/yt/fetch_recent_videos.py
@@ -6,7 +6,7 @@
   python scripts/yt/fetch_recent_videos.py [--channel URL] [--hours 24] [--state PATH]
 
 Outputs JSON array to stdout:
-  [{"id": "abc123", "title": "Video Title", "upload_date": "20260311"}, ...]
+  [{"id": "abc123", "title": "Video Title", "upload_date": "20260311", "modified_date": "20260312"}, ...]
 """
 
 import argparse
@@ -27,7 +27,7 @@ def fetch_channel_videos(channel_url: str, max_items: int = 10) -> list[dict]:
     cmd = [
         "yt-dlp",
         "--playlist-items", f"1:{max_items}",
-        "--print", "%(id)s\t%(title)s\t%(upload_date)s\t%(thumbnail)s",
+        "--print", "%(id)s\t%(title)s\t%(upload_date)s\t%(modified_date)s\t%(thumbnail)s",
         "--skip-download",
         f"{channel_url}/videos",
     ]
@@ -38,14 +38,20 @@ def fetch_channel_videos(channel_url: str, max_items: int = 10) -> list[dict]:
 
     videos = []
     for line in result.stdout.strip().splitlines():
-        parts = line.split("\t", 3)
+        parts = line.split("\t", 4)
         if len(parts) >= 2:
             vid_id = parts[0]
+            upload = parts[2] if len(parts) >= 3 else "NA"
+            modified = parts[3] if len(parts) >= 4 else "NA"
+            # Treat modified_date as null if same as upload_date or unavailable
+            if modified in ("NA", "", upload):
+                modified = None
             vid = {
                 "id": vid_id,
                 "title": parts[1],
-                "upload_date": parts[2] if len(parts) >= 3 else "NA",
-                "thumbnail": parts[3] if len(parts) >= 4 and parts[3] != "NA"
+                "upload_date": upload,
+                "modified_date": modified,
+                "thumbnail": parts[4] if len(parts) >= 5 and parts[4] != "NA"
                     else f"https://i.ytimg.com/vi/{vid_id}/hqdefault.jpg",
             }
             videos.append(vid)