Possibility of forced alignment support? #52
conncomg123
started this conversation in
Ideas
Replies: 2 comments
-
Discussion #3 has some implementations for getting token/word-level timestamps. Getting timestamps for each phoneme would be difficult from Whisper models only, because the model is end-to-end trained to predict BPE tokens directly, which are often a full word or subword consisting of a few graphemes. An alternative could be using an external forced alignment tool based on the outputs from a Whisper model. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Update: I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This already supports timestamps for phrases-- would it be possible for forced alignment support? That is, given an audio file and, optionally, its transcript, could it produce a file containing the timestamps of each phoneme?
Beta Was this translation helpful? Give feedback.
All reactions