|
| 1 | +# Advanced Efficient Techniques |
| 2 | + |
| 3 | +In addition to standard model compression methods, some advanced |
| 4 | +approaches are being developed to accelerate the decoding process of the |
| 5 | +large models. These methods include generating specific tokens using |
| 6 | +smaller models and the ability to generate multiple tokens in a single |
| 7 | +step, resulting in accelerating the decoding process. Furthermore, there |
| 8 | +are techniques that utilize the memory hierarchy for high throughput |
| 9 | +computation, aiming to decrease memory I/O, and as a result, be more |
| 10 | +efficient. |
| 11 | + |
| 12 | +## Speculative Decoding |
| 13 | + |
| 14 | +Speculative decoding is a strategy to speed up the decoding process, |
| 15 | +based on insights provided by Leviathan et al. [@leviathan2023fast]. |
| 16 | + |
| 17 | +1. Complex modeling tasks frequently encompass simpler subtasks that |
| 18 | + can be effectively approximated using more efficient models. |
| 19 | + |
| 20 | +2. By combining speculative execution with a unique sampling approach, |
| 21 | + it is possible to accelerate exact decoding from larger models. This |
| 22 | + is achieved by processing them with the outputs from the |
| 23 | + approximation models in parallel. |
| 24 | + |
| 25 | +Figure :numref:`ch-deploy/sd` is a brief overview of Speculative |
| 26 | +Decoding. It involves initially generating a series of tokens using a |
| 27 | +draft model, which is a smaller and less complex model. These generated |
| 28 | +tokens are then verified in parallel with the target model, which is a |
| 29 | +larger model. The tokens that are finally executed in the output are |
| 30 | +those that are accepted by the target model from the initial draft |
| 31 | +tokens. Additionally, if rejection occurs, one more token is resampled |
| 32 | +and generated from the adjusted distribution. If there is no rejection, |
| 33 | +an extra token is generated by the target model using the draft tokens |
| 34 | +as context. |
| 35 | + |
| 36 | +<figure id="fig:ch-deploy/sd"> |
| 37 | +<div class="center"> |
| 38 | +<img src="../img/ch08/sd.png" style="width:95.0%" /> |
| 39 | +</div> |
| 40 | +<figcaption>Speculative Decoding Overview</figcaption> |
| 41 | +</figure> |
| 42 | + |
| 43 | +To elaborate, the process begins with the draft model generating a |
| 44 | +series of $\gamma$ tokens, denoted as $x_1, x_2, ..., x_{\gamma}$. |
| 45 | +Subsequently, it preserves the distributions |
| 46 | +$q_{1}(x), q_{2}(x), ..., q_{\gamma}(x)$ of these tokens for future |
| 47 | +verification by the target model. These $\gamma$ tokens are then |
| 48 | +inputted into the target model in parallel to calculate the logits for |
| 49 | +the respective token combinations |
| 50 | +$p_{1}(x), p_{2}(x), ..., p_{\gamma+1}(x)$, derived from |
| 51 | +$M_{\text{target}}(\text{prefix} + [x_1 + ... + x_{\gamma}])$. If the |
| 52 | +condition $q(x) < p(x)$ is met, the token is retained. In contrast, if |
| 53 | +not met, the token faces a rejection chance of $1 - \frac{p(x)}{q(x)}$, |
| 54 | +following which it is reselected from an adjusted distribution: |
| 55 | + |
| 56 | +$$p'(x) = norm(max(0, p(x) - q(x)))$$ |
| 57 | +:eqlabel:`equ:sd_adjusted` |
| 58 | + |
| 59 | +In the paper [@leviathan2023fast], |
| 60 | +Leviathan et al. have proved the correctness of this adjusted |
| 61 | +distribution for resampling. |
0 commit comments