Skip to content

Commit 2348be2

Browse files
committed
Add Deploy DeepSeek R1 on Arm CPU learning path
Signed-off-by: Tianyu Li <[email protected]>
1 parent 706d1ac commit 2348be2

File tree

4 files changed

+519
-0
lines changed

4 files changed

+519
-0
lines changed
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
title: Deploy DeepSeek R1 on Arm servers
3+
4+
minutes_to_complete: 30
5+
6+
who_is_this_for: This is an introductory topic for developers interested in running DeepSeek-R1 on Arm-based servers.
7+
8+
learning_objectives:
9+
- Download and build llama.cpp on your Arm server.
10+
- Download a pre-quantized DeepSeek-R1 model from Hugging Face.
11+
- Run the pre-quantized model on your Arm CPU and measure the performance.
12+
13+
prerequisites:
14+
- An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
15+
16+
author:
17+
- Tianyu Li
18+
19+
### Tags
20+
skilllevels: Introductory
21+
subjects: ML
22+
armips:
23+
- Neoverse
24+
operatingsystems:
25+
- Linux
26+
tools_software_languages:
27+
- LLM
28+
- GenAI
29+
- Python
30+
31+
32+
further_reading:
33+
- resource:
34+
title: Getting started with DeepSeek-R1
35+
link: https://huggingface.co/deepseek-ai/DeepSeek-R1
36+
type: documentation
37+
- resource:
38+
title: Hugging Face Documentation
39+
link: https://huggingface.co/docs
40+
type: documentation
41+
- resource:
42+
title: Democratizing Generative AI with CPU-based inference
43+
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
44+
type: blog
45+
- resource:
46+
title: DeepSeek-R1-GGUF
47+
link: https://huggingface.co/bartowski/DeepSeek-R1-GGUF
48+
type: website
49+
50+
51+
52+
### FIXED, DO NOT MODIFY
53+
# ================================================================================
54+
weight: 1 # _index.md always has weight of 1 to order correctly
55+
layout: "learningpathall" # All files under learning paths have this same wrapper
56+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
57+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
---
2+
title: Run a DeepSeek R1 chatbot on Arm servers
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Before you begin
10+
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 64 cores and 512GB of RAM to run this example. Configure disk storage up to at least 400 GB. The instructions have been tested on an AWS Graviton4 r8g.16xlarge instance.
11+
12+
## Overview
13+
14+
Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you learn how to run generative AI inference-based use cases like a LLM chatbot on Arm-based CPUs. You do this by deploying the [DeepSeek-R1 GGUF models](https://huggingface.co/bartowski/DeepSeek-R1-GGUF) on your Arm-based CPU using `llama.cpp`.
15+
16+
[llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project developed by Georgi Gerganov that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud.
17+
18+
## About the DeepSeek-R1 model and GGUF model format
19+
20+
The [DeepSeek-R1 model](https://huggingface.co/deepseek-ai/DeepSeek-R1) from DeepSeek-AI is free to use for research and commercial purposes.
21+
22+
The DeepSeek-R1 model has 671 billion parameters, based on Mixture of Experts(MoE) architecture. This improve inference speed and keep good quality result. For this example, the full 671 billion (671B) model is used for retaining quality chatbot capability while also running efficiently on your Arm-based CPU.
23+
24+
Traditionally, the training and inference of LLMs has been done on GPUs using full-precision 32-bit (FP32) or half-precision 16-bit (FP16) data type formats for the model parameter and weights. Recently, a new binary model format called GGUF was introduced by the `llama.cpp` team. This new GGUF model format uses compression and quantization techniques that remove the dependency on using FP32 and FP16 data type formats. For example, GGUF supports quantization where model weights that are generally stored as FP16 data types are scaled down to 4-bit integers. This significantly reduces the need for computational resources and the amount of RAM required. These advancements made in the model format and the data types used make Arm CPUs a great fit for running LLM inferences.
25+
26+
## Install dependencies
27+
28+
Install the following packages on your Arm based server instance:
29+
30+
```bash
31+
sudo apt update
32+
sudo apt install make cmake -y
33+
```
34+
35+
You also need to install `gcc` on your machine:
36+
37+
```bash
38+
sudo apt install gcc g++ -y
39+
sudo apt install build-essential -y
40+
```
41+
42+
## Download and build llama.cpp
43+
44+
You are now ready to start building `llama.cpp`.
45+
46+
Clone the source repository for llama.cpp:
47+
48+
```bash
49+
git clone https://github.com/ggerganov/llama.cpp
50+
```
51+
52+
By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on.
53+
54+
Run `cmake` to build it:
55+
56+
```bash
57+
cd llama.cpp
58+
mkdir build
59+
cd build
60+
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
61+
cmake --build . -v --config Release -j `nproc`
62+
```
63+
64+
`llama.cpp` is now built in the `bin` directory.
65+
Check that `llama.cpp` has built correctly by running the help command:
66+
67+
```bash
68+
cd bin
69+
./llama-cli -h
70+
```
71+
72+
If `llama.cpp` has built correctly on your machine, you will see the help options being displayed. A snippet of the output is shown below:
73+
74+
```output
75+
usage: ./llama-cli [options]
76+
77+
general:
78+
79+
-h, --help, --usage print usage and exit
80+
--version show version and build info
81+
-v, --verbose print verbose information
82+
--verbosity N set specific verbosity level (default: 0)
83+
--verbose-prompt print a verbose prompt before generation (default: false)
84+
--no-display-prompt don't print prompt at generation (default: false)
85+
-co, --color colorise output to distinguish prompt and user input from generations (default: false)
86+
-s, --seed SEED RNG seed (default: -1, use random seed for < 0)
87+
-t, --threads N number of threads to use during generation (default: 4)
88+
-tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
89+
-td, --threads-draft N number of threads to use during generation (default: same as --threads)
90+
-tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft)
91+
--draft N number of tokens to draft for speculative decoding (default: 5)
92+
-ps, --p-split N speculative decoding split probability (default: 0.1)
93+
-lcs, --lookup-cache-static FNAME
94+
path to static lookup cache to use for lookup decoding (not updated by generation)
95+
-lcd, --lookup-cache-dynamic FNAME
96+
path to dynamic lookup cache to use for lookup decoding (updated by generation)
97+
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
98+
-n, --predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
99+
-b, --batch-size N logical maximum batch size (default: 2048)
100+
```
101+
102+
103+
## Install Hugging Face Hub
104+
105+
There are a few different ways you can download the DeepSeek-R1 model. In this Learning Path, you download the model from Hugging Face.
106+
107+
[Hugging Face](https://huggingface.co/) is an open source AI community where you can host your own AI models, train them and collaborate with others in the community. You can browse through the thousands of models that are available for a variety of use cases like NLP, audio, and computer vision.
108+
109+
The `huggingface_hub` library provides APIs and tools that let you easily download and fine-tune pre-trained models. You will use `huggingface-cli` to download the [DeepSeek-R1 model](https://huggingface.co/bartowski/DeepSeek-R1-GGUF).
110+
111+
Install the required Python packages:
112+
113+
```bash
114+
sudo apt install python-is-python3 python3-pip python3-venv -y
115+
```
116+
117+
Create and activate a Python virtual environment:
118+
119+
```bash
120+
python -m venv venv
121+
source venv/bin/activate
122+
```
123+
124+
Your terminal prompt now has the `(venv)` prefix indicating the virtual environment is active. Use this virtual environment for the remaining commands.
125+
126+
Install the `huggingface_hub` python library using `pip`:
127+
128+
```bash
129+
pip install huggingface_hub
130+
```
131+
132+
You can now download the model using the huggingface cli:
133+
134+
```bash
135+
huggingface-cli download bartowski/DeepSeek-R1-GGUF --include "*DeepSeek-R1-Q4_0*" --local-dir DeepSeek-R1-Q4_0
136+
```
137+
Before you proceed and run this model, take a quick look at what `Q4_0` in the model name denotes.
138+
139+
## Quantization format
140+
141+
`Q4_0` in the model name refers to the quantization method the model uses. The goal of quantization is to reduce the size of the model (to reduce the memory space required) and faster (to reduce memory bandwidth bottlenecks transferring large amounts of data from memory to a processor). The primary trade-off to keep in mind when reducing a model's size is maintaining quality of performance. Ideally, a model is quantized to meet size and speed requirements while not having a negative impact on performance.
142+
143+
This model is `DeepSeek-R1-Q4_0-00001-of-00010.gguf`, so what does each component mean in relation to the quantization level? The main thing to note is the number of bits per parameter, which is denoted by 'Q4' in this case or 4-bit integer. As a result, by only using 4 bits per parameter for 671 billion parameters, the model drops to be 354 GB in size.
144+
145+
## Run the pre-quantized DeepSeek-R1 LLM model weights on your Arm-based server
146+
147+
As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for performance optimization with three types of GEMV/GEMM kernels corresponding to three processor types:
148+
149+
* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels),
150+
* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support, and
151+
* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support
152+
153+
With the latest commits in `llama.cpp` you will see improvements for these Arm optimized kernels directly on your Arm-based server. You can run the pre-quantized Q4_0 model as is and do not need to re-quantize the model.
154+
155+
Run the pre-quantized DeepSeek-R1 model exactly as the weights were downloaded from huggingface:
156+
157+
```bash
158+
./llama-cli -m DeepSeek-R1-Q4_0-00001-of-00010.gguf -no-cnv --temp 0.6 -t 64 --prompt "<|User|>Building a visually appealing website can be done in ten simple steps:<|Assistant|>" -n 512
159+
```
160+
161+
This command will use the downloaded model (`-m` flag), disable conversation mode explicitly (`-no-cnv` flag), adjust the randomness of the generated text (`--temp` flag), with the specified prompt (`-p` flag), and target a 512 token completion (`-n` flag), using 64 threads (`-t` flag).
162+
163+
You may notice there are many gguf files downloaded, llama.cpp can load all series of files by passing the first one with `-m` flag.
164+
165+
You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton4 r8g.16xlarge instance is shown below:
166+
167+
```output
168+
build: 4879 (f08f4b31) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
169+
...
170+
load_tensors: CPU_Mapped model buffer size = 35048.27 MiB
171+
load_tensors: CPU_Mapped model buffer size = 22690.15 MiB
172+
....................................................................................................
173+
llama_init_from_model: n_seq_max = 1
174+
llama_init_from_model: n_ctx = 4096
175+
llama_init_from_model: n_ctx_per_seq = 4096
176+
llama_init_from_model: n_batch = 2048
177+
llama_init_from_model: n_ubatch = 512
178+
llama_init_from_model: flash_attn = 0
179+
llama_init_from_model: freq_base = 10000.0
180+
llama_init_from_model: freq_scale = 0.025
181+
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
182+
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
183+
llama_kv_cache_init: CPU KV buffer size = 19520.00 MiB
184+
llama_init_from_model: KV self size = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
185+
llama_init_from_model: CPU output buffer size = 0.49 MiB
186+
llama_init_from_model: CPU compute buffer size = 1186.01 MiB
187+
llama_init_from_model: graph nodes = 5025
188+
llama_init_from_model: graph splits = 1
189+
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
190+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
191+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
192+
main: llama threadpool init, n_threads = 64
193+
194+
system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | AARCH64_REPACK = 1 |
195+
196+
sampler seed: 3199001937
197+
sampler params:
198+
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
199+
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
200+
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
201+
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
202+
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
203+
generate: n_ctx = 4096, n_batch = 2048, n_predict = 512, n_keep = 1
204+
205+
<|User|>Building a visually appealing website can be done in ten simple steps:<think>
206+
Okay, the user wants to build a visually appealing website in ten simple steps. Let me think about how to break this down. First, I need to outline the key stages of web design without getting too technical. Starting with defining the purpose and audience makes sense because that sets the foundation. Then, choosing the right tools like website builders or coding from scratch. Maybe mention popular platforms like WordPress or Wix as examples.
207+
208+
Next, planning the site structure. Sitemaps and user flow are important here. I should emphasize the importance of clear navigation to avoid confusion. Then, selecting a color scheme and typography. Maybe explain how colors affect user emotions and the importance of readability with fonts. Don't forget accessibility here, like contrast ratios and font sizes.
209+
210+
Responsive design is a must these days, so I need to stress mobile optimization. Maybe mention frameworks like Bootstrap or media queries if coding. Visual hierarchy comes next – using size, spacing, and contrast to guide users. High-quality visuals are crucial, so suggest using stock photos or custom graphics, and optimizing for web to keep load times down.
211+
212+
Whitespace is often overlooked but important for a clean look. Then, call-to-action buttons – their placement and design. Testing across devices and browsers is a key step before launch. Finally, ongoing maintenance and SEO basics. Wait, the user mentioned ten steps. Let me count again.
213+
214+
1. Define purpose and audience
215+
2. Choose tools
216+
3. Plan structure
217+
4. Select colors and fonts
218+
5. Responsive design
219+
6. Visual hierarchy
220+
7. High-quality visuals
221+
8. Whitespace
222+
9. CTAs
223+
10. Test and optimize
224+
225+
That's ten. But maybe step 10 should include maintenance and SEO. Hmm. Maybe combine testing with ongoing SEO as part of the final step. Also, should I mention content creation? Maybe under high-quality visuals. Need to keep each step concise. Also, check if any steps can be merged or rephrased for clarity. Let me make sure each step is actionable and not too vague. Add examples where possible, like suggesting tools or resources. Accessibility should be mentioned in color contrast and typography. Alright, that should cover it without overwhelming the user.
226+
</think>
227+
228+
Here’s a streamlined 10-step guide to building a visually appealing and functional website:
229+
230+
---
231+
232+
### **1. Define Purpose & Audience**
233+
- **Goal:** Clarify the site’s objective (e.g., portfolio, e-commerce, blog).
234+
- **Audience
235+
236+
llama_perf_sampler_print: sampling time = 38.85 ms / 532 runs ( 0.07 ms per token, 13694.75 tokens per second)
237+
llama_perf_context_print: load time = 1061927.81 ms
238+
llama_perf_context_print: prompt eval time = 6585.21 ms / 20 tokens ( 329.26 ms per token, 3.04 tokens per second)
239+
llama_perf_context_print: eval time = 47463.45 ms / 511 runs ( 92.88 ms per token, 10.77 tokens per second)
240+
llama_perf_context_print: total time = 54172.15 ms / 531 tokens
241+
```
242+
243+
The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton4 instance, you will see:
244+
245+
* NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions
246+
* ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions
247+
* MATMUL_INT8 = 1 This flag indicates support for Arm int8 matrix multiplication instructions
248+
* SVE = 1 This flag indicates support for the Arm Scalable Vector Extension
249+
250+
251+
The end of the output shows several model timings:
252+
253+
* load time refers to the time taken to load the model.
254+
* prompt eval time refers to the time taken to process the prompt before generating the new text.
255+
* eval time refers to the time taken to generate the output. Generally anything above 10 tokens per second is faster than what humans can read.
256+
257+
You have successfully run a LLM chatbot with Arm KleidiAI optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
258+

0 commit comments

Comments
 (0)