Indexing and Formatting Special Phrases (e.g. Bible Chapter & Verse, Journal References) #422

turnkit · 2022-10-26T17:59:37Z

turnkit
Oct 26, 2022

I can see where Whisper would have a great demand for every weekly Sunday church service.

I haven't yet noticed, but how does it handle Christian Bible book and verse references? Do they show up as "John three sixteen" or "John 3:16?" The latter format is considered normal and correct for such a reference.

For such a specific use case, but which use would likely widely be utilized, could we add an option to look for a set format of book names followed two numbers and prefer the standard numerical representation with the colon between?

As well as normalizing the formatting of "Book_Name Chapter:Verse" could we get generated a time-coded "index of verses referenced" list?

This would be the 39 (Protestant) or 46 (Catholic) Old Testament Book names, and 27 New Testament book names, or whatever else referenced list that should be formatted in a specific manner. Other groups, e.g. Islam, Mormons, etc. have their own short list of special titles that use a chapter and verse specialized format to reference as well.

I would think some science journals also might have similar formatting standards.

Specifying an output format display preference for special terms, and optionally generating and index list of those terms would be very useful feature for the transcription output. While the latter could likely be done post processing, for "accurate" transcription, it seems like getting the normalized formatting of the transcription right from Whisper would be ideal.

Is it inside the scope of whisper to add non-AI post formatting corrections? Getting clean transcription at one stop seems ideal.

jddcef · 2025-06-26T06:31:52Z

jddcef
Jun 26, 2025

Interesting, do update us on what you have found out or done since?

I asked @copilot https://github.com/copilot/share/c831531a-4064-80e0-8103-9a00444a605f to answer your question.
"It explained that while Whisper focuses on accurate transcription and does not natively support this kind of domain-specific formatting (and says it probably shouldn't have domain specific things in it), its output can be post-processed with scripts to achieve the requested normalization and indexing. It also clarified that Whisper can output word-level timestamps if the appropriate setting is enabled, making such indexing feasible. "

0 replies

whicks1 · 2025-06-27T05:42:18Z

whicks1
Jun 27, 2025

My group has transcribed hundreds of hours of university lectures, seminars, and related media from a Christian university among many other general av collections. Whisper produces passable, but by no means consistent, renderings of numbers across a variety of categories (especially at scale) and biblical references are no different. It’s been a while, but best I recall I was seeing numbered books typically as “First/Second” instead of 1/2 and never Roman numerals. Never v. Or vv. (Implied or spoken) usually verse/verses, possibly but probably rarely misspelled versus, and all manner of number-word combos for chapter and verse numbers but I don’t think ever natively as colon separated INTs. And I don’t remember any clean spans or abbreviations like 2:3-10”.

I have a larger post processing script that attempts to normalize these where I can but found it needs to be done judiciously since more general collections you wouldn’t want to capitalize Numbers, Kings, etc.

On a similar note, whisper has long had known problems with hallucinations during silence and instrumental music so various sorts of church services are going to suffer during those parts of a recording and though it’s usually quite good with casing proper nouns there are many instances in our datasets where common theological styles aren’t accounted for or consistent (The Word, King of Kings, Yahweh vs YHWH, etc.) Possibly more concerning we see sometimes dropped casing, so odd cases of “moses, god, christ, etc.”

My suggestion would be some robust regex searches across output files in a text editor like sublime or feeding the outputs into your favorite llm.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Indexing and Formatting Special Phrases (e.g. Bible Chapter & Verse, Journal References) #422

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Indexing and Formatting Special Phrases (e.g. Bible Chapter & Verse, Journal References) #422

Uh oh!

turnkit Oct 26, 2022

Replies: 2 comments

Uh oh!

jddcef Jun 26, 2025

Uh oh!

whicks1 Jun 27, 2025

turnkit
Oct 26, 2022

jddcef
Jun 26, 2025

whicks1
Jun 27, 2025