Skip to content

Commit acaa8ee

Browse files
committed
Upload sections
1 parent c6243cb commit acaa8ee

File tree

1 file changed

+61
-0
lines changed

1 file changed

+61
-0
lines changed
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Advanced Efficient Techniques
2+
3+
In addition to standard model compression methods, some advanced
4+
approaches are being developed to accelerate the decoding process of the
5+
large models. These methods include generating specific tokens using
6+
smaller models and the ability to generate multiple tokens in a single
7+
step, resulting in accelerating the decoding process. Furthermore, there
8+
are techniques that utilize the memory hierarchy for high throughput
9+
computation, aiming to decrease memory I/O, and as a result, be more
10+
efficient.
11+
12+
## Speculative Decoding
13+
14+
Speculative decoding is a strategy to speed up the decoding process,
15+
based on insights provided by Leviathan et al. [@leviathan2023fast].
16+
17+
1. Complex modeling tasks frequently encompass simpler subtasks that
18+
can be effectively approximated using more efficient models.
19+
20+
2. By combining speculative execution with a unique sampling approach,
21+
it is possible to accelerate exact decoding from larger models. This
22+
is achieved by processing them with the outputs from the
23+
approximation models in parallel.
24+
25+
Figure :numref:`ch-deploy/sd` is a brief overview of Speculative
26+
Decoding. It involves initially generating a series of tokens using a
27+
draft model, which is a smaller and less complex model. These generated
28+
tokens are then verified in parallel with the target model, which is a
29+
larger model. The tokens that are finally executed in the output are
30+
those that are accepted by the target model from the initial draft
31+
tokens. Additionally, if rejection occurs, one more token is resampled
32+
and generated from the adjusted distribution. If there is no rejection,
33+
an extra token is generated by the target model using the draft tokens
34+
as context.
35+
36+
<figure id="fig:ch-deploy/sd">
37+
<div class="center">
38+
<img src="../img/ch08/sd.png" style="width:95.0%" />
39+
</div>
40+
<figcaption>Speculative Decoding Overview</figcaption>
41+
</figure>
42+
43+
To elaborate, the process begins with the draft model generating a
44+
series of $\gamma$ tokens, denoted as $x_1, x_2, ..., x_{\gamma}$.
45+
Subsequently, it preserves the distributions
46+
$q_{1}(x), q_{2}(x), ..., q_{\gamma}(x)$ of these tokens for future
47+
verification by the target model. These $\gamma$ tokens are then
48+
inputted into the target model in parallel to calculate the logits for
49+
the respective token combinations
50+
$p_{1}(x), p_{2}(x), ..., p_{\gamma+1}(x)$, derived from
51+
$M_{\text{target}}(\text{prefix} + [x_1 + ... + x_{\gamma}])$. If the
52+
condition $q(x) < p(x)$ is met, the token is retained. In contrast, if
53+
not met, the token faces a rejection chance of $1 - \frac{p(x)}{q(x)}$,
54+
following which it is reselected from an adjusted distribution:
55+
56+
$$p'(x) = norm(max(0, p(x) - q(x)))$$
57+
:eqlabel:`equ:sd_adjusted`
58+
59+
In the paper [@leviathan2023fast],
60+
Leviathan et al. have proved the correctness of this adjusted
61+
distribution for resampling.

0 commit comments

Comments
 (0)