Skip to content

Commit a2f82d1

Browse files
committed
final draft for deepseek r1 page
1 parent 62272f5 commit a2f82d1

File tree

1 file changed

+21
-18
lines changed

1 file changed

+21
-18
lines changed

docs/natural_language_processing/deepseek.md

Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -31,44 +31,42 @@ DeepSeek-R1 inherits DeepSeek-V3's MoE architecture featuring:
3131
- Shared experts: 8 always-active generalist modules handling common patterns
3232
- Routed experts: 128 specialized modules activated based on input content
3333

34-
This architecture enables 3.8x higher training efficiency compared to dense models while maintaining 97% quality retention.
35-
3634
!!! Hint
3735
If you want to learn more about the MoE framework and models, you can refer this [article](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts).
3836

39-
### Key Architectural Innovations
40-
1. Multi-Head Latent Attention (MLA)
37+
<!-- ### Key Architectural Innovations
38+
1. **Multi-Head Latent Attention (MLA)**
4139
- Compresses key-value cache into latent vectors (16x smaller than standard attention)
4240
- Achieves 48% faster inference compared to traditional attention mechanisms
4341
- Maintains 98.7% of original attention performance while reducing memory usage
4442
45-
2. FP8 Mixed Precision Training
43+
2. **FP8 Mixed Precision Training**
4644
- Utilizes 8-bit floating point for matrix multiplications
4745
- Implements fine-grained quantization with:
4846
- Dynamic scaling factors for numerical stability
4947
- 16-bit accumulation for precision-critical operations
5048
- Reduces memory consumption by 37% during training
5149
52-
3. Auxiliary-Loss-Free Load Balancing
50+
3. **Auxiliary-Loss-Free Load Balancing**
5351
- Novel bias adjustment strategy prevents expert overload
5452
- Achieves 93.4% expert utilization rate vs 78% in conventional MoE
5553
- Eliminates performance degradation from balancing constraints
5654
57-
4. Expert Choice Routing
55+
4. **Expert Choice Routing**
5856
- Implements two-level selection process:
5957
- Each expert selects top-k tokens (k=2)
6058
- Each token selects top-2 experts from those that chose it
6159
- Achieves 4.2x better load balance than traditional routing
6260
6361
### Enhanced Training Infrastructure
64-
- Multi-Token Prediction Objective
62+
- **Multi-Token Prediction Objective**
6563
- Predicts 6 future tokens simultaneously
6664
- Reduces training steps required by 27%
6765
- Improves code generation accuracy by 15%
6866
69-
- Sparse MoE Layers
67+
- **Sparse MoE Layers**
7068
- Replaces 80% of dense FFN layers with MoE blocks
71-
- Achieves 4.9x higher throughput than dense architectures
69+
- Achieves 4.9x higher throughput than dense architectures -->
7270

7371
### Performance Optimization
7472
| Metric | DeepSeek-V3 | Conventional MoE |
@@ -91,7 +89,7 @@ mixing.
9189

9290
<figure markdown>
9391
![](../imgs/nlp_deepseek_r1zero.png)
94-
<figcaption>AIME accuracy of DeepSeek-R1-Zero during training. For each question, [we] sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation. [1]</figcaption>
92+
<figcaption>AIME accuracy of DeepSeek-R1-Zero during training. For each question, [author] sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation. [1]</figcaption>
9593
</figure>
9694

9795
### 2. Cold Start for DeepSeek-R1
@@ -109,7 +107,7 @@ After the cold start, DeepSeek-R1 underwent large-scale RL training focused on e
109107

110108
### 4. Rejection Sampling and Supervised Fine-Tuning
111109

112-
Upon convergence of the reasoning-oriented RL, the researchers collected new Supervised Fine-Tuning (SFT) data through rejection sampling. This data included both reasoning and non-reasoning tasks, enhancing the model's general capabilities.
110+
Upon convergence of the reasoning-oriented RL, the researchers collected new Supervised Fine-Tuning (SFT) data through [rejection sampling](../machine_learning/interview_questions.md#what-is-rejection-sampling-in-machine-learning). This data included both reasoning and non-reasoning tasks, enhancing the model's general capabilities.
113111

114112
### 5. Reinforcement Learning for All Scenarios
115113

@@ -119,6 +117,11 @@ The final stage involved another round of RL, this time aimed at improving the m
119117

120118
To make the advanced reasoning capabilities more accessible, the researchers distilled DeepSeek-R1's knowledge into smaller dense models based on Qwen and Llama architectures. For distilled models, authors apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance.
121119

120+
!!! Note
121+
There is a major takeaway from this analysis regarding the efficiency of Distillation on different technique GPRO vs SFT: Transferring knowledge from advanced AI models to smaller versions ("distillation") often works better than training compact models (< 3B models) with resource-heavy reinforcement learning (RL), which demands massive computing power and still underperforms.
122+
123+
In short, if your model is <3B parameters and you have sufficient data, consider supervised finetuning over RL based training.
124+
122125
## Experiments
123126

124127
The researchers conducted extensive evaluations of DeepSeek-R1 across a wide range of benchmarks, including:
@@ -161,15 +164,15 @@ DeepSeek-R1 demonstrated impressive performance across various benchmarks:
161164

162165
## Code
163166

164-
DeepSeek-R1-Zero exhibits an “aha moment” during training. This happened during the RL training phase wherein the model allocates more thinking time to a problem by reevaluating its initial approach. This behavior showcases the model’s growing reasoning abilities and the unexpected sophistication of reinforcement learning outcomes. The algorithm credited to this moment was Group Relative Policy Optimization (GRPO). Based on this, there has been several attempts to replicate similar moment using much smaller models.
167+
DeepSeek-R1-Zero exhibits an “aha moment” during training. This happened during the RL training phase wherein the model allocates more thinking time to a problem by reevaluating its initial approach. This behavior showcases the model’s growing reasoning abilities and the unexpected sophistication of reinforcement learning outcomes. The algorithm credited to this is Group Relative Policy Optimization (GRPO). Based on this, there has been several attempts to replicate similar moment using much smaller models.
165168

166-
In Mini-R1 [3], the author ([Philipp Schmid](https://www.philschmid.de/)) wanted to recreate the small "aha moment" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. The aim was to train an open model (`Qwen-2.5-3B`) using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game. For context, the Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to reach or get as close as possible to a target number. At the end, the author was able to achieve 50% accuracy by 450th step of training. One interesting point to note is that in the experiment, GRPO with two rule-based rewards demanded a lot of power: 4 H100 GPUs for 6 hours over 450 training steps on a 3-billion-parameter model. This illustrates the hefty compute required for scaling reinforcement learning—remember, DeepSeek powered through with a 671-billion model over 8000 steps.
169+
In Mini-R1 [3], the author ([Philipp Schmid](https://www.philschmid.de/)) wanted to recreate the small "aha moment" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. The aim was to train an open model (`Qwen-2.5-3B`) using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game. For context, the Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to reach or get as close as possible to a target number. At the end, the author was able to achieve 50% accuracy by 450th step of training. One interesting point to note is that in the experiment, GRPO with two rule-based rewards demanded a lot of power: 4 H100 GPUs for 6 hours over 450 training steps on a 3-billion-parameter model. This illustrates the hefty compute required for scaling reinforcement learning—remember, DeepSeek a 671-billion model gained its performance after training over 8000 steps!
167170

168-
In [another attempt](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb), author ([Will brown](https://gist.github.com/willccbb)) tried to fine-tune `Qwen2.5-1.5B-Instruct` model on school math word problem dataset, which was optimised for Google Colab by [Anton](https://x.com/abacaj) for `Qwen-2.5-0.5B` base model [here](https://colab.research.google.com/drive/1bfhs1FMLW3FGa8ydvkOZyBNxLYOu0Hev?usp=sharing#scrollTo=PYykgnUJ0BdB). This increased the base model's performance on by 10% on from 41.6% to ~51% via GRPO.
171+
In [another attempt](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb), author ([Will brown](https://gist.github.com/willccbb)) tried to fine-tune `Qwen2.5-1.5B-Instruct` model on school math word problem dataset, which was optimised for Google Colab by [Anton](https://x.com/abacaj) for `Qwen-2.5-0.5B` base model [here](https://colab.research.google.com/drive/1bfhs1FMLW3FGa8ydvkOZyBNxLYOu0Hev?usp=sharing#scrollTo=PYykgnUJ0BdB). This increased the base model's performance on by 10% from 41.6% to ~51% via GRPO.
169172

170173
GRPO code is available on `trl` python package and the process to finetune your model is as simple as shown below, [3]
171174

172-
```python linenums="1
175+
```python linenums="1"
173176
# install the packages
174177
!pip install trl
175178

@@ -218,10 +221,10 @@ trainer = GRPOTrainer(
218221
trainer.train()
219222
```
220223

221-
And that's it! It is recommended to refer to [3] for more specific details about training and to continue training (by increasing the steps) until the model converges. Do remember to save intermediate checkpoints while training.
224+
And that's it! It is recommended to refer to [3] for more specific details about training and remember to continue training *(by increasing the steps)* until the model converges. Do save intermediate checkpoints while training.
222225

223226
!!! Hint
224-
All of the models are open-sourced and can be downloaded from [DeepSeek page](https://huggingface.co/deepseek-ai) on HuggingFace.
227+
All of the DeepSeek models are open-sourced and can be downloaded from the [DeepSeek page](https://huggingface.co/deepseek-ai) on HuggingFace.
225228

226229
## Limitations
227230

0 commit comments

Comments
 (0)