Upload sections

mikebo93 · mikebo93 · commit acaa8ee671d4 · 2025-03-26T22:39:11.000Z
diff --git a/chapter_model_deployment/Advanced_Efficient_Techniques.md b/chapter_model_deployment/Advanced_Efficient_Techniques.md
@@ -0,0 +1,61 @@
+# Advanced Efficient Techniques
+
+In addition to standard model compression methods, some advanced
+approaches are being developed to accelerate the decoding process of the
+large models. These methods include generating specific tokens using
+smaller models and the ability to generate multiple tokens in a single
+step, resulting in accelerating the decoding process. Furthermore, there
+are techniques that utilize the memory hierarchy for high throughput
+computation, aiming to decrease memory I/O, and as a result, be more
+efficient.
+
+## Speculative Decoding
+
+Speculative decoding is a strategy to speed up the decoding process,
+based on insights provided by Leviathan et al. [@leviathan2023fast].
+
+1.  Complex modeling tasks frequently encompass simpler subtasks that
+    can be effectively approximated using more efficient models.
+
+2.  By combining speculative execution with a unique sampling approach,
+    it is possible to accelerate exact decoding from larger models. This
+    is achieved by processing them with the outputs from the
+    approximation models in parallel.
+
+Figure :numref:`ch-deploy/sd` is a brief overview of Speculative
+Decoding. It involves initially generating a series of tokens using a
+draft model, which is a smaller and less complex model. These generated
+tokens are then verified in parallel with the target model, which is a
+larger model. The tokens that are finally executed in the output are
+those that are accepted by the target model from the initial draft
+tokens. Additionally, if rejection occurs, one more token is resampled
+and generated from the adjusted distribution. If there is no rejection,
+an extra token is generated by the target model using the draft tokens
+as context.
+
+<figure id="fig:ch-deploy/sd">
+<div class="center">
+<img src="../img/ch08/sd.png" style="width:95.0%" />
+</div>
+<figcaption>Speculative Decoding Overview</figcaption>
+</figure>
+
+To elaborate, the process begins with the draft model generating a
+series of $\gamma$ tokens, denoted as $x_1, x_2, ..., x_{\gamma}$.
+Subsequently, it preserves the distributions
+$q_{1}(x), q_{2}(x), ..., q_{\gamma}(x)$ of these tokens for future
+verification by the target model. These $\gamma$ tokens are then
+inputted into the target model in parallel to calculate the logits for
+the respective token combinations
+$p_{1}(x), p_{2}(x), ..., p_{\gamma+1}(x)$, derived from
+$M_{\text{target}}(\text{prefix} + [x_1 + ... + x_{\gamma}])$. If the
+condition $q(x) < p(x)$ is met, the token is retained. In contrast, if
+not met, the token faces a rejection chance of $1 - \frac{p(x)}{q(x)}$,
+following which it is reselected from an adjusted distribution:
+
+$$p'(x) = norm(max(0, p(x) - q(x)))$$ 
+:eqlabel:`equ:sd_adjusted` 
+
+In the paper [@leviathan2023fast],
+Leviathan et al. have proved the correctness of this adjusted
+distribution for resampling.