You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update README for default diarization and new summarization step
- Diarization is now on by default (--no-diarize to skip)
- Document new summarization step with separate LLM backend options
- Add summarization examples, output files, pipeline stage, and cost estimates
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+43-12Lines changed: 43 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,8 @@ The approach applies principles from [textual criticism](https://en.wikipedia.or
21
21
-**Checkpoint resumption**: Long operations save checkpoints and resume after interruption — merge chunks, diarization segmentation, and embedding extraction all checkpoint independently
22
22
-**Cost estimation**: Shows estimated API costs before running (`--dry-run` for estimation only)
23
23
-**Local-first LLM**: Uses Ollama by default for free, local operation — no API key needed
24
-
-**Speaker diarization**: Identifies who is speaking using pyannote.audio, with automatic or manual speaker naming — LLM speaker identification uses video metadata (title, description) for correct name spellings
24
+
-**Speaker diarization**: On by default — identifies who is speaking using pyannote.audio, with automatic or manual speaker naming — LLM speaker identification uses video metadata (title, description) for correct name spellings
25
+
-**Transcript summarization**: Generates a structured summary (overview, key points, speakers, notable quotes) using an independently configurable LLM — can use a different model/backend than the adjudication LLM
25
26
-**Timestamped logging**: All pipeline output prefixed with `[HH:MM:SS]` wall-clock timestamps for log correlation during long runs
26
27
-**Whisper-only mode**: `--no-llm` to skip all LLM features and run Whisper only
Summarization runs by default after markdown generation, producing a `summary.md`. It uses the diarized transcript when available (for speaker-aware summaries), falling back to the merged or Whisper transcript.
161
+
162
+
```bash
163
+
# Default: summarize with the same LLM as adjudication (local Ollama)
|[4b] Merge transcript sources |`merge`| LLM + wdiff | Yes (on by default; `--no-merge` to skip) |
200
229
|[5] Generate markdown |`markdown`| Python | No |
230
+
|[5b] Summarize transcript |`summarize`| LLM | Yes (on by default; `--no-summarize` to skip) |
201
231
|[6] Source survival analysis |`analysis`| wdiff | No |
202
232
203
-
Use `--steps <step1>,<step2>,...` to run only specific stages. Existing outputs from skipped stages are loaded automatically. This is useful for re-running just the ensembleor merge after fixing a bug, without re-downloading or re-transcribing.
233
+
Use `--steps <step1>,<step2>,...` to run only specific stages. Existing outputs from skipped stages are loaded automatically. This is useful for re-running just the ensemble, merge, or summarize after fixing a bug, without re-downloading or re-transcribing.
204
234
205
235
## How It Works
206
236
@@ -262,7 +292,7 @@ This targeted diff resolution avoids the problems of full-text rewriting (chunk-
262
292
263
293
### Speaker Diarization
264
294
265
-
When `--diarize` is enabled, the pipeline identifies who is speaking at each point in the audio by combining two independent signals:
295
+
By default, the pipeline identifies who is speaking at each point in the audio by combining two independent signals:
266
296
267
297
1.**pyannote.audio** runs a neural segmentation model over the audio in sliding ~5-second windows, producing frame-level speaker activity probabilities. A global clustering step stitches local predictions across the full recording into consistent speaker labels (SPEAKER_00, SPEAKER_01, etc.). The model handles overlapping speech natively and operates purely on the audio signal — no linguistic content is used.
0 commit comments