Lookahead decoding performance improvements #8374

kevmo314 · 2024-07-08T15:14:17Z

kevmo314
Jul 8, 2024

I've been looking into why the lookahead decoding implementation added in #4207 doesn't seem to be quite as fast as the paper advertises.

I've discovered that the lookahead decoding implementation requires ~N times the number of sampling evaluations and thus is more penalized by slow sampling. In particular, the crux of the improvement seems to take advantage of very cheap sampling compared to slower sequential evaluation. That is, (if I understand the paper correctly) sampling an additional O(W) tokens and incurring an O(W) additional sampling cost + O(W) batch decode is cheaper than the cost of sequential O(W) batch decodes.

It seems that the n-gram management cost is largely negligible from what I can tell.

This led to finding two optimizations that luckily also improve performance outside of lookahead decoding:

After these two PRs, I am able to see lookahead decoding perform about ~5% faster with W=4, N=3, G=4, temp=0.0, however this is still quite a ways from the paper's advertised ~50%. Additionally, my evaluations seem to be also worse because the sampling performance is O(n_vocab) and I am testing Llama3 which has a larger vocabulary whereas the paper is testing Llama2, so my sampling cost is doubled but the batch evaluation performance is not any more expensive. This seems to be a caveat that isn't really mentioned in the paper.

Has anyone else looked into this? I'd like to open up this discussion to talk about possible performance improvements to the sampling code, in particular some ideas specific to sampling performance that I am thinking about right now:

Sampling with temp=0.0 performs two passes over the logits, one to fetch the logits, one to calculate the max. We can instead perform one singular pass to double-ish the performance.
Sampling is performed one index at a time, would it be faster to batch the sampling when multiple indices are present?
Is it possible to take advantage of accelerators for sampling? We currently do a CPU pass over the logits, maybe there is a faster solution?
Is there a more efficient way to wire through the probabilities for llama_sample_token_with_rng? We seem to take a vector of logits, wrap them in structs, apply temperature, and then unwrap them again to pass to std::discrete_distribution. This seems a bit roundabout.

Additionally, I'd like to start looking into improving batch decode performance too. I am a little surprised at the performance penalty there (I'm seeing ~10ms/additional batch) however I haven't dug too deep yet and would love to hear other perspectives on potential improvements.

ggerganov · 2024-07-09T08:51:23Z

ggerganov
Jul 9, 2024
Maintainer

Additionally, I'd like to start looking into improving batch decode performance too. I am a little surprised at the performance penalty there (I'm seeing ~10ms/additional batch) however I haven't dug too deep yet and would love to hear other perspectives on potential improvements.

This largely depends on your hardware and the quantization that you use. Run llama-bench for bs=1,2,3,4,5,... and see how far from the theoretical linear scaling your scenario is

Is it possible to take advantage of accelerators for sampling?

The ideal option is to perform sampling on the GPU - some discussion here: #5214

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lookahead decoding performance improvements #8374

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Lookahead decoding performance improvements #8374

Uh oh!

kevmo314 Jul 8, 2024

Replies: 1 comment

Uh oh!

ggerganov Jul 9, 2024 Maintainer

kevmo314
Jul 8, 2024

ggerganov
Jul 9, 2024
Maintainer