Skip to content

Commit 8e82653

Browse files
ringgerclaude
andcommitted
Bump version to 1.1.0, update README for 3-way ensemble, fix license format
- Update README: 3-way default, anti-hallucination flags, distil-large-v3, A/B/C adjudication, Distil-Whisper acknowledgment - Bump version to 1.1.0 in pyproject.toml and __init__.py (were inconsistent) - Fix license field to SPDX string format (was deprecated table format) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 4862640 commit 8e82653

File tree

3 files changed

+20
-16
lines changed

3 files changed

+20
-16
lines changed

README.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ The approach applies principles from [textual criticism](https://en.wikipedia.or
1212

1313
- **Critical text merging**: Combines 2–3+ transcript sources into the most accurate version using blind, anonymous presentation to an LLM — no source receives preferential treatment
1414
- **wdiff-based alignment**: Uses longest common subsequence alignment (via `wdiff`) to keep chunks properly aligned across sources of different lengths, replacing naive proportional slicing
15-
- **Multi-model Whisper ensembling**: Runs multiple Whisper models (e.g., small + medium) and resolves disagreements via LLM
16-
- **Hallucination detection**: Automatically detects and collapses Whisper repetition loops (e.g., a phrase repeated 60+ times) in both raw outputs and merged transcripts
15+
- **Multi-model Whisper ensembling**: Runs multiple Whisper models (default: small + medium + distil-large-v3) and resolves disagreements via LLM with anonymous A/B/C labels
16+
- **Anti-hallucination**: Whisper runs use `condition_on_previous_text=False` and other flags to prevent cascading hallucination; residual repetition loops are automatically detected and collapsed
1717
- **External transcript support**: Merges in human-edited transcripts (e.g., from publisher websites) as an additional source
1818
- **Structured transcript preservation**: When external transcripts have speaker labels and timestamps, the merged output preserves that structure
1919
- **Slide extraction and analysis**: Automatic scene detection for presentation slides, with optional vision API descriptions
@@ -125,8 +125,8 @@ transcribe-critic "https://youtube.com/watch?v=..." --analyze-slides
125125
# Custom output directory
126126
transcribe-critic "https://youtube.com/watch?v=..." -o ./my_transcript
127127

128-
# Use specific Whisper models
129-
transcribe-critic "https://youtube.com/watch?v=..." --whisper-models large
128+
# Use specific Whisper models (default: small,medium,distil-large-v3)
129+
transcribe-critic "https://youtube.com/watch?v=..." --whisper-models medium,distil-large-v3
130130

131131
# Use a different local model
132132
transcribe-critic "https://youtube.com/watch?v=..." --local-model llama3.3
@@ -163,6 +163,8 @@ output_dir/
163163
├── whisper_small.json # Whisper small with timestamps
164164
├── whisper_medium.txt # Whisper medium transcript
165165
├── whisper_medium.json # Whisper medium with timestamps
166+
├── whisper_distil-large-v3.txt # Whisper distil-large-v3 transcript
167+
├── whisper_distil-large-v3.json # Whisper distil-large-v3 with timestamps
166168
├── whisper_merged.txt # Merged from multiple Whisper models via adjudication
167169
├── diarization.json # Speaker segments (if --diarize)
168170
├── diarization_segmentation.npy # Cached segmentation (if --diarize)
@@ -190,7 +192,7 @@ Optional stages are skipped based on flags. Stage numbers are fixed regardless o
190192
|-------|-----------|------|----------|
191193
| [1] Download media | `download` | yt-dlp | No |
192194
| [2] Transcribe audio | `transcribe` | mlx-whisper | No |
193-
| [2b] Whisper ensemble | `ensemble` | LLM + wdiff | Yes (on by default with 2+ models) |
195+
| [2b] Whisper ensemble | `ensemble` | LLM + wdiff | Yes (on by default with 2+ models; default: 3 models) |
194196
| [2c] Speaker diarization | `diarize` | pyannote.audio | Yes (`--diarize`) |
195197
| [3] Extract slides | `slides` | ffmpeg | Yes (skipped with `--no-slides` / `--podcast`) |
196198
| [4] Analyze slides with vision | `slides` | LLM + vision | Yes (`--analyze-slides`) |
@@ -247,13 +249,14 @@ Each source alone gets some things right and others wrong. Whisper hallucinates
247249

248250
### Multi-Model Whisper Merging
249251

250-
When using multiple Whisper models (default: `small,medium`):
252+
When using multiple Whisper models (default: `small,medium,distil-large-v3`):
251253

252-
1. Runs each model independently
253-
2. Uses `wdiff` to identify specific word-level differences (normalized: no caps, no punctuation)
254-
3. Clusters nearby differences and presents each cluster to an LLM with anonymous labels ("A" / "B") and surrounding context
255-
4. The LLM picks A or B for each disagreement — constrained to choose between actual transcriptions, preventing hallucinated text
256-
5. Chosen readings are surgically applied to the base transcript, leaving uncontested regions untouched
254+
1. Runs each model independently with anti-hallucination flags
255+
2. Uses `wdiff` to identify specific word-level differences between each non-base model and the base (largest model)
256+
3. For 3+ models, merges pairwise diffs at the same positions into unified diffs with per-model readings
257+
4. Clusters nearby differences and presents each cluster to an LLM with anonymous labels (A/B or A/B/C) and surrounding context — model names are never revealed
258+
5. The LLM picks a letter for each disagreement — constrained to choose between actual transcriptions, preventing hallucinated text
259+
6. Chosen readings are surgically applied to the base transcript, leaving uncontested regions untouched
257260

258261
This targeted diff resolution avoids the problems of full-text rewriting (chunk-boundary duplication, errors in uncontested regions, wasted tokens). The implementation runs Whisper-vs-Whisper adjudication first to produce a single merged Whisper witness (`whisper_merged.txt`), which then enters the multi-source merge alongside captions and external transcripts.
259262

@@ -285,7 +288,7 @@ Every stage checks `is_up_to_date(output, *inputs)` — if the output file is ne
285288
ESTIMATED API COSTS
286289
==================================================
287290
Source merging: 3 sources × 59 chunks = $1.03
288-
Whisper ensemble: 2 models × 98 clusters = $0.72
291+
Whisper ensemble: 3 models × 98 clusters = $0.72
289292
290293
TOTAL: $1.95 (estimate)
291294
==================================================
@@ -366,6 +369,7 @@ MIT
366369
## Acknowledgments
367370

368371
- [OpenAI Whisper](https://github.com/openai/whisper) — Speech recognition
372+
- [Distil-Whisper](https://github.com/huggingface/distil-whisper) — Distilled large-v3 model (faster, fewer hallucinations)
369373
- [MLX Whisper](https://github.com/ml-explore/mlx-examples) — Apple Silicon optimization
370374
- [yt-dlp](https://github.com/yt-dlp/yt-dlp) — Media downloading
371375
- [Anthropic Claude](https://www.anthropic.com/) — LLM-based adjudication and vision analysis

pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,16 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "transcribe-critic"
7-
version = "1.0.0"
7+
version = "1.1.0"
88
description = "Multi-source speech transcription with LLM-based adjudication inspired by textual criticism"
99
readme = "README.md"
10-
license = {file = "LICENSE"}
10+
license = "MIT"
11+
license-files = ["LICENSE"]
1112
requires-python = ">=3.10"
1213
keywords = ["transcription", "whisper", "speech", "llm", "diarization"]
1314
classifiers = [
1415
"Development Status :: 4 - Beta",
1516
"Intended Audience :: Science/Research",
16-
"License :: OSI Approved :: MIT License",
1717
"Programming Language :: Python :: 3",
1818
"Programming Language :: Python :: 3.10",
1919
"Programming Language :: Python :: 3.11",

src/transcribe_critic/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.1.0"
1+
__version__ = "1.1.0"

0 commit comments

Comments
 (0)