Skip to content

Transcript correction#3

Open
IvoTomasovich wants to merge 2 commits intotapilab:mainfrom
IvoTomasovich:transcript-correction
Open

Transcript correction#3
IvoTomasovich wants to merge 2 commits intotapilab:mainfrom
IvoTomasovich:transcript-correction

Conversation

@IvoTomasovich
Copy link

Adds a three-stage fuzzy correction step between transcription and summarization that fixes common Whisper misspellings of New Orleans names and streets.

New ETL step: correct_transcript.py

Reads raw .json transcripts from BOX_PATH

Corrects the text field in each segment using three stages: hardcoded fixes, fuzzy street matching, fuzzy name matching

Writes corrected files to BOX_PATH/corrected_transcripts/

Skips files already corrected; originals are never modified

No LLM calls — fully deterministic fuzzy string matching

Pipeline order:

transcribe_council.py → correct_transcript.py → summarize.py → vectorize.py

@aronwc
Copy link
Member

aronwc commented Mar 11, 2026

@IvoTomasovich I have merged Emma's PR. Please pull the latest code, merge with yours, then resubmit the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants