Replies: 2 comments 2 replies
-
any update on this? |
Beta Was this translation helpful? Give feedback.
2 replies
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi guys, I've encountered some issues with obtaining timestamps. I'm trying to get sentence-level timestamps using the fine-tuned Whisper large-v2 model through transformers.pipeline. Despite setting "return_timestamps" to True or specifying "word", the results were unsatisfactory. I suspect this is because the model, having been fine-tuned on about 30,000 hours of audio data, has lost its ability to predict the special timestamp tokens necessary for generating sentence or word-level timestamps. This suspicion was reinforced after checking the output tokens. Consequently, I've come up with an alternative approach to obtain token-level timestamps: Since the input data dimensions for the Encoder are fixed and ordered (chunk=30s, with each point's time-window being 25ms and a stride of 10ms), and considering the characteristics of transformer networks, I believe the dimensions of the Decoder's output (before postprocessing) should also be fixed and ordered. This implies that each output token could correspond to each of the input data points. Then we can get the timestamps of each output tokens. However, I'm not very familiar with the transformer structure and the whisper pipeline. Could you help me assess whether this idea is feasible? Thanks!
Beta Was this translation helpful? Give feedback.
All reactions