Skip to content

Commit cd1fb9f

Browse files
ringgerclaude
andcommitted
Update README for default diarization and new summarization step
- Diarization is now on by default (--no-diarize to skip) - Document new summarization step with separate LLM backend options - Add summarization examples, output files, pipeline stage, and cost estimates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 26c1b93 commit cd1fb9f

File tree

1 file changed

+43
-12
lines changed

1 file changed

+43
-12
lines changed

README.md

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ The approach applies principles from [textual criticism](https://en.wikipedia.or
2121
- **Checkpoint resumption**: Long operations save checkpoints and resume after interruption — merge chunks, diarization segmentation, and embedding extraction all checkpoint independently
2222
- **Cost estimation**: Shows estimated API costs before running (`--dry-run` for estimation only)
2323
- **Local-first LLM**: Uses Ollama by default for free, local operation — no API key needed
24-
- **Speaker diarization**: Identifies who is speaking using pyannote.audio, with automatic or manual speaker naming — LLM speaker identification uses video metadata (title, description) for correct name spellings
24+
- **Speaker diarization**: On by default — identifies who is speaking using pyannote.audio, with automatic or manual speaker naming — LLM speaker identification uses video metadata (title, description) for correct name spellings
25+
- **Transcript summarization**: Generates a structured summary (overview, key points, speakers, notable quotes) using an independently configurable LLM — can use a different model/backend than the adjudication LLM
2526
- **Timestamped logging**: All pipeline output prefixed with `[HH:MM:SS]` wall-clock timestamps for log correlation during long runs
2627
- **Whisper-only mode**: `--no-llm` to skip all LLM features and run Whisper only
2728

@@ -85,16 +86,20 @@ transcribe-critic --podcast "https://podcasts.apple.com/us/podcast/..."
8586

8687
### Speaker Diarization
8788

89+
Diarization is on by default. It requires `pyannote.audio` and a HuggingFace token:
90+
8891
```bash
89-
# Identify who is speaking (requires pyannote.audio and HF_TOKEN)
9092
pip install transcribe-critic[diarize]
9193
export HF_TOKEN="hf_..." # HuggingFace token with pyannote model access
9294

9395
# Auto-detect speaker names from introductions
94-
transcribe-critic --diarize --num-speakers 2 --podcast "https://..."
96+
transcribe-critic --num-speakers 2 --podcast "https://..."
9597

9698
# Manual speaker names (in order of first appearance)
97-
transcribe-critic --diarize --speaker-names "Ross Douthat,Dario Amodei" --podcast "https://..."
99+
transcribe-critic --speaker-names "Ross Douthat,Dario Amodei" --podcast "https://..."
100+
101+
# Disable diarization
102+
transcribe-critic --no-diarize "https://..."
98103
```
99104

100105
### Speech-Only (No Slides)
@@ -150,13 +155,36 @@ transcribe-critic "https://youtube.com/watch?v=..." --steps transcribe,merge -o
150155
transcribe-critic "https://youtube.com/watch?v=..." -v
151156
```
152157

158+
### Summarization
159+
160+
Summarization runs by default after markdown generation, producing a `summary.md`. It uses the diarized transcript when available (for speaker-aware summaries), falling back to the merged or Whisper transcript.
161+
162+
```bash
163+
# Default: summarize with the same LLM as adjudication (local Ollama)
164+
transcribe-critic "https://youtube.com/watch?v=..."
165+
166+
# Use a different model for summaries (e.g., Opus for summaries, Sonnet for adjudication)
167+
transcribe-critic "https://youtube.com/watch?v=..." --api \
168+
--summary-model claude-opus-4-20250514
169+
170+
# Use Claude API for summaries even when adjudication uses local Ollama
171+
transcribe-critic "https://youtube.com/watch?v=..." \
172+
--summary-api --summary-model claude-sonnet-4-20250514
173+
174+
# Re-run just the summarization step
175+
transcribe-critic "https://youtube.com/watch?v=..." --steps summarize -o ./my_transcript
176+
177+
# Disable summarization
178+
transcribe-critic "https://youtube.com/watch?v=..." --no-summarize
179+
```
180+
153181
## Output Files
154182

155183
```
156184
output_dir/
157185
├── metadata.json # Source URL, title, duration, etc.
158186
├── audio.mp3 # Downloaded audio
159-
├── audio.wav # Converted for diarization (if --diarize)
187+
├── audio.wav # Converted for diarization (default; skipped with --no-diarize)
160188
├── video.mp4 # Downloaded video (if slides enabled)
161189
├── captions.en.vtt # YouTube captions (if available)
162190
├── whisper_small.txt # Whisper small transcript
@@ -166,11 +194,12 @@ output_dir/
166194
├── whisper_distil-large-v3.txt # Whisper distil-large-v3 transcript
167195
├── whisper_distil-large-v3.json # Whisper distil-large-v3 with timestamps
168196
├── whisper_merged.txt # Merged from multiple Whisper models via adjudication
169-
├── diarization.json # Speaker segments (if --diarize)
170-
├── diarization_segmentation.npy # Cached segmentation (if --diarize)
171-
├── diarization_embeddings.npy # Cached embeddings (if --diarize)
172-
├── diarized.txt # Speaker-labeled transcript (if --diarize)
197+
├── diarization.json # Speaker segments (default; skipped with --no-diarize)
198+
├── diarization_segmentation.npy # Cached segmentation (default; skipped with --no-diarize)
199+
├── diarization_embeddings.npy # Cached embeddings (default; skipped with --no-diarize)
200+
├── diarized.txt # Speaker-labeled transcript (default; skipped with --no-diarize)
173201
├── transcript_merged.txt # Critical text (merged from all sources)
202+
├── summary.md # Transcript summary (structured Markdown)
174203
├── analysis.md # Source survival analysis
175204
├── transcript.md # Final markdown output
176205
├── merge_chunks/ # Per-chunk checkpoints (resumable)
@@ -193,14 +222,15 @@ Optional stages are skipped based on flags. Stage numbers are fixed regardless o
193222
| [1] Download media | `download` | yt-dlp | No |
194223
| [2] Transcribe audio | `transcribe` | mlx-whisper | No |
195224
| [2b] Whisper ensemble | `ensemble` | LLM + wdiff | Yes (on by default with 2+ models; default: 3 models) |
196-
| [2c] Speaker diarization | `diarize` | pyannote.audio | Yes (`--diarize`) |
225+
| [2c] Speaker diarization | `diarize` | pyannote.audio | Yes (on by default; `--no-diarize` to skip) |
197226
| [3] Extract slides | `slides` | ffmpeg | Yes (skipped with `--no-slides` / `--podcast`) |
198227
| [4] Analyze slides with vision | `slides` | LLM + vision | Yes (`--analyze-slides`) |
199228
| [4b] Merge transcript sources | `merge` | LLM + wdiff | Yes (on by default; `--no-merge` to skip) |
200229
| [5] Generate markdown | `markdown` | Python | No |
230+
| [5b] Summarize transcript | `summarize` | LLM | Yes (on by default; `--no-summarize` to skip) |
201231
| [6] Source survival analysis | `analysis` | wdiff | No |
202232

203-
Use `--steps <step1>,<step2>,...` to run only specific stages. Existing outputs from skipped stages are loaded automatically. This is useful for re-running just the ensemble or merge after fixing a bug, without re-downloading or re-transcribing.
233+
Use `--steps <step1>,<step2>,...` to run only specific stages. Existing outputs from skipped stages are loaded automatically. This is useful for re-running just the ensemble, merge, or summarize after fixing a bug, without re-downloading or re-transcribing.
204234

205235
## How It Works
206236

@@ -262,7 +292,7 @@ This targeted diff resolution avoids the problems of full-text rewriting (chunk-
262292

263293
### Speaker Diarization
264294

265-
When `--diarize` is enabled, the pipeline identifies who is speaking at each point in the audio by combining two independent signals:
295+
By default, the pipeline identifies who is speaking at each point in the audio by combining two independent signals:
266296

267297
1. **pyannote.audio** runs a neural segmentation model over the audio in sliding ~5-second windows, producing frame-level speaker activity probabilities. A global clustering step stitches local predictions across the full recording into consistent speaker labels (SPEAKER_00, SPEAKER_01, etc.). The model handles overlapping speech natively and operates purely on the audio signal — no linguistic content is used.
268298

@@ -301,6 +331,7 @@ ESTIMATED API COSTS
301331
| Whisper ensemble | $0.05–$0.15 | $0.50–$1.00 |
302332
| Source merging (2 sources) | $0.10–$0.30 | $0.50–$1.00 |
303333
| Source merging (3 sources) | $0.15–$0.40 | $1.00–$2.00 |
334+
| Summarization | $0.01–$0.05 | $0.05–$0.15 |
304335
| Slide analysis | $0.50–$2.00 | N/A |
305336
| Local Ollama (default) | **Free** | **Free** |
306337
| `--no-llm` | **Free** | **Free** |

0 commit comments

Comments
 (0)