You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey guys, Awesome work! Thank you for that.
I'm new to ML, and I just want to understand how the inference happens in reality.
Correct me if I'm wrong, but after training the model, what happens is we're building a mapping between the wave features (Frequency, Amplitude etc..) and the words, so when I say, "High", this extracts features from the audio after doing all the encoding and prep to the audio, then tries to find in the model the closest word to these features.
We added another "weight" based on the predictions dictionary to make it better. So If I say "the mountain is high" because there is an "is" before, the model suggests "high" instead of "Hi."
If the above is true, does that mean the more training you do, the bigger your model will get? Can a small model be built even after training on a vast dataset?
My second question is about the large model. It performs better in non-English languages than small or tiny, or medium. Can I extract a model only for that specific language from the large model? So it's not as big as the large model, but better performing than medium.
Third question is, does the model size impact the response time?
Thank you again and looking forward to hearing from you.
You're correct that the model learns the statistical correlations between the wave features and the words, but during that process we don't add new weights to the model, but a fixed number of existing weights are adjusted to better represent the relationship between the audio and the transcript. For Whisper (except the new large-v2), we started with 5 different model sizes (i.e. the number of weights in each model) and trained with the same amount of data. Typically, larger models are more flexible and end up performing better than the smaller ones.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hey guys, Awesome work! Thank you for that.
I'm new to ML, and I just want to understand how the inference happens in reality.
Correct me if I'm wrong, but after training the model, what happens is we're building a mapping between the wave features (Frequency, Amplitude etc..) and the words, so when I say, "High", this extracts features from the audio after doing all the encoding and prep to the audio, then tries to find in the model the closest word to these features.
We added another "weight" based on the predictions dictionary to make it better. So If I say "the mountain is high" because there is an "is" before, the model suggests "high" instead of "Hi."
If the above is true, does that mean the more training you do, the bigger your model will get? Can a small model be built even after training on a vast dataset?
My second question is about the large model. It performs better in non-English languages than small or tiny, or medium. Can I extract a model only for that specific language from the large model? So it's not as big as the large model, but better performing than medium.
Third question is, does the model size impact the response time?
Thank you again and looking forward to hearing from you.
Beta Was this translation helpful? Give feedback.
All reactions