Speculative Decoding: 2x faster inference for Whisper large-v2 #1914

sanchit-gandhi · 2023-12-20T16:17:47Z

sanchit-gandhi
Dec 20, 2023

Speculative decoding gives 2x faster Whisper inference while ensuring exactly the same outputs, making it the perfect drop-in replacement for existing Whisper pipelines ⚡️

Check out the blog post and accompanying Google Colab, or continue reading for details👇

How does it work? 🧐

Speculative decoding uses a smaller, faster model to assist the generation of a slower, larger one 🤝 By auto-regressively generating with the smaller model, and only performing validation forward passes with the larger one, the inference time can be reduced by over a factor of 2.

What about the accuracy? 🎯

Since the larger model validates the candidate tokens from the smaller one, speculative decoding mathematically ensures the same outputs are achieved as using the main model alone. This means you can run the large-v2 model 2x faster with no degradation to word-error rate.

Which models can I use? 🏎️

Speculative decoding applies to all languages covered by Whisper 🌎 For English speech recognition, you can use Distil-Whisper as the assistant to Whisper. For other languages, you can use Whisper tiny as the assistant to Whisper large-v2 and achieve comparable speed-ups. The only constraint is that the assistant model must use the same tokenizer as the larger one.

How do I get started? 👨‍💻

Speculative decoding is fully supported in the 🤗 Transformers library. You simply need to pass the assistant model to the "generate" method at inference time, and the algorithm will be applied using the inputs you provide. Here's a minimal working example for getting started.

First, install the Transformers and Accelerate libraries. We'll also install Datasets to load and pre-process a single audio example from the LibriSpeech dataset.

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

We can then run inference with Whisper large-v2, using Distil-Whisper as the assistant:

from transformers import AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import torch
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# load main model
model_id = "openai/whisper-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# load assistant model
assistant_model_id = "distil-whisper/distil-large-v2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

# load shared processor (both models use the same tokenizer)
processor = AutoProcessor.from_pretrained(model_id)

# instantiate pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

# load librispeech example
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample)
print(result["text"])

For more details, refer to the blog post and accompanying Google Colab: https://huggingface.co/blog/whisper-speculative-decoding

unk1911 · 2023-12-22T11:04:09Z

unk1911
Dec 22, 2023

not sure.. i tried it and i'm not seeing a speed improvement. i was using it with the openai/whisper-large-v3.

4 replies

ryanheise Dec 22, 2023

What did you choose as the assistant model? From the blog post:

The only constraint for selecting an assistant model is that it must share the same vocabulary as the main model. [...] Whisper large-v3 is an exception, since it is the only Whisper checkpoint with an expanded vocabulary size, and thus is not compatible with previous Whisper checkpoints.

unk1911 Dec 22, 2023

oh i see. well i chose distil-whisper/distil-large-v2 as the assistant but with the large-v3. i wasn't aware that it was not compatible with large-v3. will try again with large-v2.

ryanheise Dec 22, 2023

Also from the blog post:

the assistant model should be at least 3x faster than the main model (the more the better)

(so large-v2 would not make a good assistant model, it would be a good main model.)

You could use the tiny model as the assistant, or you can use the one from the example at the top of this discussion.

sanchit-gandhi Dec 22, 2023
Author

The tips @ryanheise has suggested are spot on 👌 Feel free to share the code you're using @unk1911 and we can have a look through together!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative Decoding: 2x faster inference for Whisper large-v2 #1914

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Speculative Decoding: 2x faster inference for Whisper large-v2 #1914

Uh oh!

Uh oh!

sanchit-gandhi Dec 20, 2023

How does it work? 🧐

What about the accuracy? 🎯

Which models can I use? 🏎️

How do I get started? 👨‍💻

Replies: 1 comment · 4 replies

Uh oh!

unk1911 Dec 22, 2023

Uh oh!

ryanheise Dec 22, 2023

Uh oh!

unk1911 Dec 22, 2023

Uh oh!

ryanheise Dec 22, 2023

Uh oh!

sanchit-gandhi Dec 22, 2023 Author

sanchit-gandhi
Dec 20, 2023

Replies: 1 comment 4 replies

unk1911
Dec 22, 2023

sanchit-gandhi Dec 22, 2023
Author