Skip to content

Commit 0031d55

Browse files
ali-alshaar7Ali Alshaarawy
andauthored
Add Deepseek r1 distill llama models (#1922)
Co-authored-by: Ali Alshaarawy <[email protected]>
1 parent e7338f6 commit 0031d55

File tree

4 files changed

+53
-0
lines changed

4 files changed

+53
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@ Every model is written from scratch to maximize performance and remove layers of
142142
| Qwen2.5 Coder | 0.5B, 1.5B, 3B, 7B, 14B, 32B | Alibaba Group | [Hui, Binyuan et al. 2024](https://arxiv.org/abs/2409.12186) |
143143
| Qwen2.5 Math | 1.5B, 7B, 72B | Alibaba Group | [An, Yang et al. 2024](https://arxiv.org/abs/2409.12122) |
144144
| QwQ | 32B | Alibaba Group | [Qwen Team 2024](https://qwenlm.github.io/blog/qwq-32b-preview/) |
145+
| R1 Distll Llama | 8B, 70B | DeepSeek AI | [DeepSeek AI 2025](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf) |
145146
| SmolLM2 | 135M, 360M, 1.7B | Hugging Face | [Hugging Face 2024](https://github.com/huggingface/smollm) |
146147
| Salamandra | 2B, 7B | Barcelona Supercomputing Centre | [BSC-LTC 2024](https://github.com/BSC-LTC/salamandra) |
147148
| StableCode | 3B | Stability AI | [Stability AI 2023](https://stability.ai/blog/stablecode-llm-generative-ai-coding) |

litgpt/config.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2400,5 +2400,52 @@ def norm_class(self) -> Type:
24002400
copy["hf_config"]["name"] = c["hf_config"]["name"].format(kind)
24012401
configs.append(copy)
24022402

2403+
###############
2404+
# DeepSeek R1 Distill
2405+
###############
2406+
2407+
r1_distill_llama = [
2408+
# https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/blob/main/config.json
2409+
dict(
2410+
name="R1-Distill-Llama-8B",
2411+
hf_config=dict(org="deepseek-ai", name="DeepSeek-R1-Distill-Llama-8B"),
2412+
block_size=131072,
2413+
vocab_size=128000,
2414+
padded_vocab_size=128256,
2415+
n_layer=32,
2416+
n_head=32,
2417+
n_query_groups=8,
2418+
rotary_percentage=1.0,
2419+
parallel_residual=False,
2420+
bias=False,
2421+
norm_class_name="RMSNorm",
2422+
mlp_class_name="LLaMAMLP",
2423+
intermediate_size=14336,
2424+
rope_base=500000,
2425+
rope_adjustments=dict(factor=8.0, low_freq_factor=1.0, high_freq_factor=4.0, original_max_seq_len=8192)
2426+
),
2427+
# https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/blob/main/config.json
2428+
dict(
2429+
name="R1-Distill-Llama-70B",
2430+
hf_config=dict(org="deepseek-ai", name="DeepSeek-R1-Distill-Llama-70B"),
2431+
block_size=131072,
2432+
vocab_size=128000,
2433+
padded_vocab_size=128256,
2434+
n_layer=80,
2435+
n_head=64,
2436+
n_embd=8192,
2437+
n_query_groups=8,
2438+
rotary_percentage=1.0,
2439+
parallel_residual=False,
2440+
bias=False,
2441+
norm_class_name="RMSNorm",
2442+
mlp_class_name="LLaMAMLP",
2443+
intermediate_size=28672,
2444+
rope_base=500000,
2445+
rope_adjustments=dict(factor=8.0, low_freq_factor=1.0, high_freq_factor=4.0, original_max_seq_len=8192)
2446+
),
2447+
]
2448+
2449+
configs.extend(r1_distill_llama)
24032450

24042451
name_to_config = {config["name"]: config for config in configs}

tests/test_model.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,8 @@ def test_against_original_open_llama_3b(device, dtype):
225225
{"name": "Llama-3.2-1B"},
226226
{"name": "Llama-3.2-3B"},
227227
{"name": "Llama-3.3-70B-Instruct"},
228+
{"name": "R1-Distill-Llama-8B"},
229+
{"name": "R1-Distill-Llama-70B"},
228230
],
229231
)
230232
@pytest.mark.parametrize(

tutorials/download_model_weights.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ LitGPT supports a variety of LLM architectures with publicly available weights.
4141
| Qwen2.5 Coder | 0.5B, 1.5B, 3B, 7B, 14B, 32B | Alibaba Group | [Hui, Binyuan et al. 2024](https://arxiv.org/abs/2409.12186) |
4242
| Qwen2.5 Math | 1.5B, 7B, 72B | Alibaba Group | [An, Yang et al. 2024](https://arxiv.org/abs/2409.12122) |
4343
| QwQ | 32B | Alibaba Group | [Qwen Team 2024](https://qwenlm.github.io/blog/qwq-32b-preview/) |
44+
| R1 Distll Llama | 8B, 70B | DeepSeek AI | [DeepSeek AI 2025](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf) |
4445
| RedPajama-INCITE | 3B, 7B | Together | [Together 2023](https://together.ai/blog/redpajama-models-v1) |
4546
| SmolLM2 | 135M, 360M, 1.7B | Hugging Face | [Hugging Face 2024](https://github.com/huggingface/smollm) |
4647
| StableCode | 3B | Stability AI | [Stability AI 2023](https://stability.ai/blog/stablecode-llm-generative-ai-coding) |
@@ -87,6 +88,8 @@ codellama/CodeLlama-7b-Python-hf
8788
databricks/dolly-v2-12b
8889
databricks/dolly-v2-3b
8990
databricks/dolly-v2-7b
91+
deepseek-ai/DeepSeek-R1-Distill-Llama-8B
92+
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
9093
EleutherAI/pythia-1.4b
9194
EleutherAI/pythia-1.4b-deduped
9295
EleutherAI/pythia-12b

0 commit comments

Comments
 (0)