Replies: 2 comments
-
Interesting, do update us on what you have found out or done since? I asked @copilot https://github.com/copilot/share/c831531a-4064-80e0-8103-9a00444a605f to answer your question. |
Beta Was this translation helpful? Give feedback.
-
My group has transcribed hundreds of hours of university lectures, seminars, and related media from a Christian university among many other general av collections. Whisper produces passable, but by no means consistent, renderings of numbers across a variety of categories (especially at scale) and biblical references are no different. It’s been a while, but best I recall I was seeing numbered books typically as “First/Second” instead of 1/2 and never Roman numerals. Never v. Or vv. (Implied or spoken) usually verse/verses, possibly but probably rarely misspelled versus, and all manner of number-word combos for chapter and verse numbers but I don’t think ever natively as colon separated INTs. And I don’t remember any clean spans or abbreviations like 2:3-10”. I have a larger post processing script that attempts to normalize these where I can but found it needs to be done judiciously since more general collections you wouldn’t want to capitalize Numbers, Kings, etc. On a similar note, whisper has long had known problems with hallucinations during silence and instrumental music so various sorts of church services are going to suffer during those parts of a recording and though it’s usually quite good with casing proper nouns there are many instances in our datasets where common theological styles aren’t accounted for or consistent (The Word, King of Kings, Yahweh vs YHWH, etc.) Possibly more concerning we see sometimes dropped casing, so odd cases of “moses, god, christ, etc.” My suggestion would be some robust regex searches across output files in a text editor like sublime or feeding the outputs into your favorite llm. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I can see where Whisper would have a great demand for every weekly Sunday church service.
I haven't yet noticed, but how does it handle Christian Bible book and verse references? Do they show up as "John three sixteen" or "John 3:16?" The latter format is considered normal and correct for such a reference.
For such a specific use case, but which use would likely widely be utilized, could we add an option to look for a set format of book names followed two numbers and prefer the standard numerical representation with the colon between?
As well as normalizing the formatting of "Book_Name Chapter:Verse" could we get generated a time-coded "index of verses referenced" list?
This would be the 39 (Protestant) or 46 (Catholic) Old Testament Book names, and 27 New Testament book names, or whatever else referenced list that should be formatted in a specific manner. Other groups, e.g. Islam, Mormons, etc. have their own short list of special titles that use a chapter and verse specialized format to reference as well.
I would think some science journals also might have similar formatting standards.
Specifying an output format display preference for special terms, and optionally generating and index list of those terms would be very useful feature for the transcription output. While the latter could likely be done post processing, for "accurate" transcription, it seems like getting the normalized formatting of the transcription right from Whisper would be ideal.
Is it inside the scope of whisper to add non-AI post formatting corrections? Getting clean transcription at one stop seems ideal.
Beta Was this translation helpful? Give feedback.
All reactions