Skip to content

Commit d4b23d0

Browse files
ringgerclaude
andcommitted
Extract merge logic into separate modules, add SpeechPP references
Split codebase into three modules to improve organization: - shared.py: SpeechConfig, SpeechData, api_call_with_retry, is_up_to_date - merge.py: all merge/alignment logic (~20 functions) - transcriber.py: pipeline orchestration, download, transcribe, slides, CLI Add SpeechPP (Ringger & Allen, 1996) references to README Background section, completing the lineage from speech post-processing through OCR multi-engine correction to LLM-based transcript merging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a42b160 commit d4b23d0

File tree

5 files changed

+1008
-956
lines changed

5 files changed

+1008
-956
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -204,12 +204,13 @@ ESTIMATED API COSTS
204204

205205
This tool applies the principles of [textual criticism](https://en.wikipedia.org/wiki/Textual_criticism) — the scholarly discipline of comparing multiple manuscript witnesses to reconstruct an authoritative text — to the problem of speech transcription.
206206

207-
The approach has roots in earlier work on OCR error correction using multiple engine outputs:
207+
The approach has roots in earlier work applying noisy-channel models and multi-source correction to speech and OCR:
208208

209+
- **Ringger & Allen (1996)**[Error Correction via a Post-Processor for Continuous Speech Recognition](https://www.researchgate.net/publication/2321329_Error_Correction_Via_A_Post-Processor_For_Continuous_Speech_Recognition) (ICASSP). Introduced SpeechPP, a noisy-channel post-processor that corrects ASR output using language and channel models with Viterbi beam search, developed as part of the [TRAINS/TRIPS](https://www.cs.rochester.edu/research/trains/) spoken dialogue systems at the University of Rochester. Extended with a fertility channel model in [Ringger & Allen, ICSLP 1996](https://scholarsarchive.byu.edu/facpub/1288/).
209210
- **Ringger & Lund (2014)**[How Well Does Multiple OCR Error Correction Generalize?](https://scholarsarchive.byu.edu/facpub/1647/) Demonstrated that aligning and merging outputs from multiple OCR engines significantly reduces word error rates.
210211
- **Lund et al. (2013)**[Error Correction with In-Domain Training Across Multiple OCR System Outputs](https://www.researchgate.net/publication/220861175_Error_Correction_with_In-Domain_Training_Across_Multiple_OCR_System_Outputs). Used A* alignment and trained classifiers (CRFs, MaxEnt) to choose the best reading from multiple OCR witnesses — a 52% relative decrease in word error rate.
211212

212-
This tool replaces the trained classifiers with an LLM, which brings world knowledge and contextual reasoning without requiring task-specific training data. The blind/anonymous presentation of sources is borrowed from peer review and prevents the LLM from developing source-level biases.
213+
The OCR work used A* alignment because page layout provides natural line boundaries, making alignment a series of short, bounded search problems. Speech has no such boundaries — different ASR systems segment a continuous audio stream arbitrarily — so this tool uses `wdiff` (LCS-based global alignment) instead. It also replaces the trained classifiers with an LLM, which brings world knowledge and contextual reasoning without requiring task-specific training data. The blind/anonymous presentation of sources is borrowed from peer review and prevents the LLM from developing source-level biases.
213214

214215
Related work in speech:
215216
- **ROVER** ([Fiscus, 1997](https://ieeexplore.ieee.org/document/659110/)) — Statistical voting across multiple ASR outputs via word transition networks

0 commit comments

Comments
 (0)