You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Shared experts: 8 always-active generalist modules handling common patterns
32
32
- Routed experts: 128 specialized modules activated based on input content
33
33
34
-
This architecture enables 3.8x higher training efficiency compared to dense models while maintaining 97% quality retention.
35
-
36
34
!!! Hint
37
35
If you want to learn more about the MoE framework and models, you can refer this [article](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts).
38
36
39
-
### Key Architectural Innovations
40
-
1. Multi-Head Latent Attention (MLA)
37
+
<!--### Key Architectural Innovations
38
+
1. **Multi-Head Latent Attention (MLA)**
41
39
- Compresses key-value cache into latent vectors (16x smaller than standard attention)
42
40
- Achieves 48% faster inference compared to traditional attention mechanisms
43
41
- Maintains 98.7% of original attention performance while reducing memory usage
44
42
45
-
2. FP8 Mixed Precision Training
43
+
2. **FP8 Mixed Precision Training**
46
44
- Utilizes 8-bit floating point for matrix multiplications
47
45
- Implements fine-grained quantization with:
48
46
- Dynamic scaling factors for numerical stability
49
47
- 16-bit accumulation for precision-critical operations
50
48
- Reduces memory consumption by 37% during training
- Achieves 93.4% expert utilization rate vs 78% in conventional MoE
55
53
- Eliminates performance degradation from balancing constraints
56
54
57
-
4. Expert Choice Routing
55
+
4. **Expert Choice Routing**
58
56
- Implements two-level selection process:
59
57
- Each expert selects top-k tokens (k=2)
60
58
- Each token selects top-2 experts from those that chose it
61
59
- Achieves 4.2x better load balance than traditional routing
62
60
63
61
### Enhanced Training Infrastructure
64
-
- Multi-Token Prediction Objective
62
+
- **Multi-Token Prediction Objective**
65
63
- Predicts 6 future tokens simultaneously
66
64
- Reduces training steps required by 27%
67
65
- Improves code generation accuracy by 15%
68
66
69
-
- Sparse MoE Layers
67
+
- **Sparse MoE Layers**
70
68
- Replaces 80% of dense FFN layers with MoE blocks
71
-
- Achieves 4.9x higher throughput than dense architectures
69
+
- Achieves 4.9x higher throughput than dense architectures-->
72
70
73
71
### Performance Optimization
74
72
| Metric | DeepSeek-V3 | Conventional MoE |
@@ -91,7 +89,7 @@ mixing.
91
89
92
90
<figuremarkdown>
93
91

94
-
<figcaption>AIME accuracy of DeepSeek-R1-Zero during training. For each question, [we] sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation. [1]</figcaption>
92
+
<figcaption>AIME accuracy of DeepSeek-R1-Zero during training. For each question, [author] sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation. [1]</figcaption>
95
93
</figure>
96
94
97
95
### 2. Cold Start for DeepSeek-R1
@@ -109,7 +107,7 @@ After the cold start, DeepSeek-R1 underwent large-scale RL training focused on e
109
107
110
108
### 4. Rejection Sampling and Supervised Fine-Tuning
111
109
112
-
Upon convergence of the reasoning-oriented RL, the researchers collected new Supervised Fine-Tuning (SFT) data through rejection sampling. This data included both reasoning and non-reasoning tasks, enhancing the model's general capabilities.
110
+
Upon convergence of the reasoning-oriented RL, the researchers collected new Supervised Fine-Tuning (SFT) data through [rejection sampling](../machine_learning/interview_questions.md#what-is-rejection-sampling-in-machine-learning). This data included both reasoning and non-reasoning tasks, enhancing the model's general capabilities.
113
111
114
112
### 5. Reinforcement Learning for All Scenarios
115
113
@@ -119,6 +117,11 @@ The final stage involved another round of RL, this time aimed at improving the m
119
117
120
118
To make the advanced reasoning capabilities more accessible, the researchers distilled DeepSeek-R1's knowledge into smaller dense models based on Qwen and Llama architectures. For distilled models, authors apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance.
121
119
120
+
!!! Note
121
+
There is a major takeaway from this analysis regarding the efficiency of Distillation on different technique GPRO vs SFT: Transferring knowledge from advanced AI models to smaller versions ("distillation") often works better than training compact models (< 3B models) with resource-heavy reinforcement learning (RL), which demands massive computing power and still underperforms.
122
+
123
+
In short, if your model is <3B parameters and you have sufficient data, consider supervised finetuning over RL based training.
124
+
122
125
## Experiments
123
126
124
127
The researchers conducted extensive evaluations of DeepSeek-R1 across a wide range of benchmarks, including:
@@ -161,15 +164,15 @@ DeepSeek-R1 demonstrated impressive performance across various benchmarks:
161
164
162
165
## Code
163
166
164
-
DeepSeek-R1-Zero exhibits an “aha moment” during training. This happened during the RL training phase wherein the model allocates more thinking time to a problem by reevaluating its initial approach. This behavior showcases the model’s growing reasoning abilities and the unexpected sophistication of reinforcement learning outcomes. The algorithm credited to this moment was Group Relative Policy Optimization (GRPO). Based on this, there has been several attempts to replicate similar moment using much smaller models.
167
+
DeepSeek-R1-Zero exhibits an “aha moment” during training. This happened during the RL training phase wherein the model allocates more thinking time to a problem by reevaluating its initial approach. This behavior showcases the model’s growing reasoning abilities and the unexpected sophistication of reinforcement learning outcomes. The algorithm credited to this is Group Relative Policy Optimization (GRPO). Based on this, there has been several attempts to replicate similar moment using much smaller models.
165
168
166
-
In Mini-R1 [3], the author ([Philipp Schmid](https://www.philschmid.de/)) wanted to recreate the small "aha moment" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. The aim was to train an open model (`Qwen-2.5-3B`) using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game. For context, the Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to reach or get as close as possible to a target number. At the end, the author was able to achieve 50% accuracy by 450th step of training. One interesting point to note is that in the experiment, GRPO with two rule-based rewards demanded a lot of power: 4 H100 GPUs for 6 hours over 450 training steps on a 3-billion-parameter model. This illustrates the hefty compute required for scaling reinforcement learning—remember, DeepSeek powered through with a 671-billion model over 8000 steps.
169
+
In Mini-R1 [3], the author ([Philipp Schmid](https://www.philschmid.de/)) wanted to recreate the small "aha moment" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. The aim was to train an open model (`Qwen-2.5-3B`) using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game. For context, the Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to reach or get as close as possible to a target number. At the end, the author was able to achieve 50% accuracy by 450th step of training. One interesting point to note is that in the experiment, GRPO with two rule-based rewards demanded a lot of power: 4 H100 GPUs for 6 hours over 450 training steps on a 3-billion-parameter model. This illustrates the hefty compute required for scaling reinforcement learning—remember, DeepSeek a 671-billion model gained its performance after training over 8000 steps!
167
170
168
-
In [another attempt](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb), author ([Will brown](https://gist.github.com/willccbb)) tried to fine-tune `Qwen2.5-1.5B-Instruct` model on school math word problem dataset, which was optimised for Google Colab by [Anton](https://x.com/abacaj) for `Qwen-2.5-0.5B` base model [here](https://colab.research.google.com/drive/1bfhs1FMLW3FGa8ydvkOZyBNxLYOu0Hev?usp=sharing#scrollTo=PYykgnUJ0BdB). This increased the base model's performance on by 10% on from 41.6% to ~51% via GRPO.
171
+
In [another attempt](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb), author ([Will brown](https://gist.github.com/willccbb)) tried to fine-tune `Qwen2.5-1.5B-Instruct` model on school math word problem dataset, which was optimised for Google Colab by [Anton](https://x.com/abacaj) for `Qwen-2.5-0.5B` base model [here](https://colab.research.google.com/drive/1bfhs1FMLW3FGa8ydvkOZyBNxLYOu0Hev?usp=sharing#scrollTo=PYykgnUJ0BdB). This increased the base model's performance on by 10% from 41.6% to ~51% via GRPO.
169
172
170
173
GRPO code is available on `trl` python package and the process to finetune your model is as simple as shown below, [3]
171
174
172
-
```python linenums="1
175
+
```python linenums="1"
173
176
# install the packages
174
177
!pip install trl
175
178
@@ -218,10 +221,10 @@ trainer = GRPOTrainer(
218
221
trainer.train()
219
222
```
220
223
221
-
And that's it! It is recommended to refer to [3] for more specific details about training and to continue training (by increasing the steps) until the model converges. Do remember to save intermediate checkpoints while training.
224
+
And that's it! It is recommended to refer to [3] for more specific details about training and remember to continue training *(by increasing the steps)* until the model converges. Do save intermediate checkpoints while training.
222
225
223
226
!!! Hint
224
-
All of the models are open-sourced and can be downloaded from [DeepSeek page](https://huggingface.co/deepseek-ai) on HuggingFace.
227
+
All of the DeepSeek models are open-sourced and can be downloaded from the[DeepSeek page](https://huggingface.co/deepseek-ai) on HuggingFace.
0 commit comments