Speculative Decoding: 2x faster inference for Whisper large-v2 #1914
sanchit-gandhi
started this conversation in
Show and tell
Replies: 1 comment 4 replies
-
not sure.. i tried it and i'm not seeing a speed improvement. i was using it with the openai/whisper-large-v3. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Speculative decoding gives 2x faster Whisper inference while ensuring exactly the same outputs, making it the perfect drop-in replacement for existing Whisper pipelines ⚡️
Check out the blog post and accompanying Google Colab, or continue reading for details👇
How does it work? 🧐
Speculative decoding uses a smaller, faster model to assist the generation of a slower, larger one 🤝 By auto-regressively generating with the smaller model, and only performing validation forward passes with the larger one, the inference time can be reduced by over a factor of 2.
What about the accuracy? 🎯
Since the larger model validates the candidate tokens from the smaller one, speculative decoding mathematically ensures the same outputs are achieved as using the main model alone. This means you can run the large-v2 model 2x faster with no degradation to word-error rate.
Which models can I use? 🏎️
Speculative decoding applies to all languages covered by Whisper 🌎 For English speech recognition, you can use Distil-Whisper as the assistant to Whisper. For other languages, you can use Whisper tiny as the assistant to Whisper large-v2 and achieve comparable speed-ups. The only constraint is that the assistant model must use the same tokenizer as the larger one.
How do I get started? 👨💻
Speculative decoding is fully supported in the 🤗 Transformers library. You simply need to pass the assistant model to the "generate" method at inference time, and the algorithm will be applied using the inputs you provide. Here's a minimal working example for getting started.
First, install the Transformers and Accelerate libraries. We'll also install Datasets to load and pre-process a single audio example from the LibriSpeech dataset.
We can then run inference with Whisper large-v2, using Distil-Whisper as the assistant:
For more details, refer to the blog post and accompanying Google Colab: https://huggingface.co/blog/whisper-speculative-decoding
Beta Was this translation helpful? Give feedback.
All reactions