debug

mikebo93 · mikebo93 · commit 4f49c83e7b10 · 2025-03-26T22:42:48.000Z
diff --git a/chapter_model_deployment/Advanced_Efficient_Techniques.md b/chapter_model_deployment/Advanced_Efficient_Techniques.md
@@ -58,4 +58,105 @@ $$p'(x) = norm(max(0, p(x) - q(x)))$$
 
 In the paper [@leviathan2023fast],
 Leviathan et al. have proved the correctness of this adjusted
-distribution for resampling.
+distribution for resampling.
+
+Under the assumption that the execution time for a single step of the
+Target model is denoted as $T$, and that of the draft model as $cT$,
+where $0<c\leq1$. The standard procedure using the target model to
+generate $\gamma + 1$ tokens would require a total time of
+$\gamma T + T$. In contrast, with speculative decoding, where
+$\gamma + 1$ tokens are produced ($\gamma$ by the draft model and one
+additional by the target model concurrently during the parallel
+verification), the time required would be $\gamma cT + T$. If all
+$\gamma$ draft tokens are accepted by the target model and $c$ is small
+enough to make $cT << T$, speculative decoding has the potential to
+significantly reduce latency during the decoding process.
+
+To further explain, if we denote $\alpha = E(\beta)$ where $\beta$ is
+the acceptance rate with a given prefix and $E(\beta)$ is a natural
+measure of how well the draft model can approximate the target model
+assuming $\beta$s are i.i.d., the expected number of tokens generated by
+the speculative process is $\frac{1-\alpha^{\gamma+1}}{1-\alpha}$
+[@leviathan2023fast]. According to the speculative decoding time for one
+superstep $\gamma cT + T$, the expected time for generating one token
+with speculative decoding is
+$\frac{(c\gamma+1)(1-\alpha)}{1-\alpha^{\gamma+1}}T$. By choosing a good
+$\gamma$ and a well-aligned efficient draft model meaning big $\alpha$
+and small $c$, the result is desired.
+
+Nevertheless, as the value of $\gamma$ continues to rise, it becomes
+progressively more difficult for a draft model to generate draft tokens
+with a high acceptance rate by the target model, especially as the
+likelihood of acceptance typically diminishes when $\gamma$ exceeds a
+certain value. In the worst-case scenario, if all draft tokens generated
+by the draft model are rejected by the target model, then only the one
+token that is resampled from the adjusted distribution will be decoded
+following the speculative process. In this situation, the time spent on
+generating $\gamma$ tokens with the draft model represented as
+$\gamma cT$ effectively becomes a complete waste of time when compared
+to generating a single token directly with the target model; in
+addition, the draft model is consuming the GPU memory.
+
+Therefore, finding the best $\gamma$ or having a well-designed draft
+model that is effectively accepted by the target model is of importance.
+There are some strategies that can be employed to address this issue
+effectively. For example:
+
+**Self-Derived Drafts from Target Models**
+
+Is it possible to utilize the target model directly as the draft model,
+rather than employing a separate smaller model, which could lead to
+increased GPU memory usage? The answer is yes. Similar to the original
+approach, the modification involves switching the draft model into the
+target model itself, followed by self-verifying these draft tokens. The
+advantages of this method are:
+
+1.  Since the draft model is almost the same as the target model, it is
+    sufficiently robust to maintain a stable acceptance rate.
+
+2.  Only need to keep one model in the GPU memory.
+
+The challenge now lies in the ability to generate multiple future tokens
+in a single decoding step. To achieve this, the concept involves
+appending additional concurrent layers to the existing output layer of
+the model. Stern et al. first proposed this method in
+[@stern2018blockwise].
+
+The training of these extra layers can either start from scratch with
+the target model or involve fine-tuning a pre-trained model. This
+approach forms the basis of the Medusa [@medusa]. Medusa's architecture
+includes extra \"Medusa heads\" attached after the last hidden layer.
+This design enables the model to generate a range of token candidates in
+just one decoding step. Subsequently, these candidates undergo a
+self-verification process, and only the accepted tokens are executed.
+
+Other methodologies, such as implementing Knowledge Distillation between
+draft and target models, employing multiple draft models instead of just
+one, and replacing draft models with retrieval datasets proposed by
+researchers are still being investigated to determine their
+effectiveness and reliability.
+
+Speculative decoding is an effective technique that uses smaller models
+to reduce the overhead caused by larger models. By developing a
+well-trained and aligned draft model, the efficiency of the decoding
+process can be significantly improved.
+
+## FlashAttention
+
+FlashAttention is an advanced optimization technique utilizing the
+memory hierarchy aimed at enhancing the efficiency of attention
+computations in transformer models in terms of memory usage and speed.
+
+Dao et al. were the first to suggest this approach, as indicated in
+[@dao2022flashattention]. They noted the absence of *IO-awareness* --
+the consideration of I/O interactions across GPU memory layers -- in the
+classic Scaled Dot-Product Attention algorithm. To address this, they
+introduced FlashAttention, an enhanced version of the attention
+algorithm designed to minimize the intensive access to the GPU's high
+bandwidth memory (HBM). This innovation led to significant gains in both
+computational speed and throughput.
+
+Figure :numref:`ch-deploy/memory` shows the memory hierarchy with
+corresponding bandwidths. The main goal of FlashAttention is to avoid
+reading and writing the large attention matrix to and from HBM. And
+perform computation in SRAM as much as possible.