Supress automatic punctuation but keep correct formatting of dates? #1597
Unanswered
martinmueller4voice
asked this question in
Q&A
Replies: 1 comment
-
Beyond dates - currency, numbers and honorifics for example may not be formatted correctly by suppressing tokens (depending on the language). An approach to fix this is to keep these tokens in whisper, but post-process the transcript. I wouldn't attempt to use regular expressions, but an NLP library such as spaCy should be able separate instances of punctuation from the other classes where you want to retain periods, commas, etc. See for example |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
For a while I was content with the method to prevent automatic punctuation by suppressing the corresponding tokens like pointed out in issue #589.
By adding token 50364 to the list of tokens to suppress, I was even able to get rid of the token "(Sprecherwechsel)" that would sometimes pop up at the beginning of my dictations.
But now I noticed that when I suppress those punctuation tokens, dates in my (German) dictations don't get formatted correctly anymore.
Without suppressing tokens, saying
vom dreiundzwanzigsten zwölften zweitausendvierzehn
yieldsvom 23.12.2014
(as expected).When I suppress the token for '.', then Whisper gives me
vom 2312 2014
.Any idea how I can suppress only the non-spoken, automatically added punctuation but keep everything else intact?
TIA
Beta Was this translation helpful? Give feedback.
All reactions