You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/deepseek-cpu/_index.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,15 +7,15 @@ cascade:
7
7
8
8
minutes_to_complete: 30
9
9
10
-
who_is_this_for: This is an introductory topic for developers interested in running DeepSeek-R1 on Arm-based servers.
10
+
who_is_this_for: This Learning Path is for developers who want to run DeepSeek-R1 on Arm-based servers.
11
11
12
12
learning_objectives:
13
-
- Download and build llama.cpp on your Arm-based server.
13
+
- Clone and build llama.cpp on your Arm-based server.
14
14
- Download a pre-quantized DeepSeek-R1 model from Hugging Face.
15
-
- Run the pre-quantized model on your Arm CPU and measure the performance.
15
+
- Run the model on your Arm CPU and benchmark its performance.
16
16
17
17
prerequisites:
18
-
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server. This Learning Path was tested on an AWS Graviton4 r8g.24xlarge instance.
18
+
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud provider or an on-premise Arm server. This Learning Path was tested on an AWS Graviton4 r8g.24xlarge instance.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-chatbot.md
+21-19Lines changed: 21 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,12 +6,10 @@ weight: 3
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Before you begin
10
-
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 64 cores and 512GB of RAM to run this example. Configure disk storage up to at least 400 GB. The instructions have been tested on an AWS Graviton4 r8g.24xlarge instance.
11
-
12
-
13
9
## Background and what you'll build
14
10
11
+
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 64 cores and 512GB of RAM to run this example. Configure disk storage up to at least 400 GB. The instructions have been tested on an AWS Graviton4 r8g.24xlarge instance.
12
+
15
13
Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will learn how to run a generative AI inference-based use case of a LLM chatbot on Arm-based CPUs by deploying the [DeepSeek-R1 671B LLM](https://huggingface.co/bartowski/DeepSeek-R1-GGUF) on your Arm-based CPU using `llama.cpp`, optimized for Arm hardware. You'll:
16
14
17
15
- Build and run `llama.cpp` with Arm-specific performance improvements.
@@ -22,17 +20,17 @@ Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will
22
20
23
21
[llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project developed by Georgi Gerganov that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud.
24
22
25
-
## About the DeepSeek-R1 model and GGUF model format
23
+
## Understanding the DeepSeek-R1 model and GGUF format
26
24
27
25
The [DeepSeek-R1 model](https://huggingface.co/deepseek-ai/DeepSeek-R1) from DeepSeek-AI available on Hugging Face, is released under the [MIT License](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE) and free to use for research and commercial purposes.
28
26
29
27
The DeepSeek-R1 model has 671 billion parameters, based on Mixture of Experts(MoE) architecture. This improves inference speed and maintains model quality. For this example, the full 671 billion (671B) model is used for retaining quality chatbot capability while also running efficiently on your Arm-based CPU.
30
28
31
29
Traditionally, the training and inference of LLMs has been done on GPUs using full-precision 32-bit (FP32) or half-precision 16-bit (FP16) data type formats for the model parameter and weights. Recently, a new binary model format called GGUF was introduced by the `llama.cpp` team. This new GGUF model format uses compression and quantization techniques that remove the dependency on using FP32 and FP16 data type formats. For example, GGUF supports quantization where model weights that are generally stored as FP16 data types are scaled down to 4-bit integers. This significantly reduces the need for computational resources and the amount of RAM required. These advancements made in the model format and the data types used make Arm CPUs a great fit for running LLM inferences.
32
30
33
-
## Install dependencies
31
+
## Install build dependencies on your Arm-based server
34
32
35
-
Install the following packages on your Arm based server instance:
33
+
Install the following packages:
36
34
37
35
```bash
38
36
sudo apt update
@@ -46,7 +44,7 @@ sudo apt install gcc g++ -y
46
44
sudo apt install build-essential -y
47
45
```
48
46
49
-
## Download and build llama.cpp
47
+
## Clone and build llama.cpp
50
48
51
49
You are now ready to start building `llama.cpp`.
52
50
@@ -107,7 +105,7 @@ general:
107
105
```
108
106
109
107
110
-
## Install Hugging Face Hub
108
+
## Set up Hugging Face and download the model
111
109
112
110
There are a few different ways you can download the DeepSeek-R1 model. In this Learning Path, you download the model from Hugging Face.
Before you proceed and run this model, take a quick look at what `Q4_0` in the model name denotes.
145
143
146
-
## Quantization format
144
+
## Understanding the Quantization format
147
145
148
146
`Q4_0` in the model name refers to the quantization method the model uses. The goal of quantization is to reduce the size of the model (to reduce the memory space required) and faster (to reduce memory bandwidth bottlenecks transferring large amounts of data from memory to a processor). The primary trade-off to keep in mind when reducing a model's size is maintaining quality of performance. Ideally, a model is quantized to meet size and speed requirements while not having a negative impact on performance.
149
147
150
148
This model is `DeepSeek-R1-Q4_0-00001-of-00010.gguf`, so what does each component mean in relation to the quantization level? The main thing to note is the number of bits per parameter, which is denoted by 'Q4' in this case or 4-bit integer. As a result, by only using 4 bits per parameter for 671 billion parameters, the model drops to be 354 GB in size.
151
149
152
-
## Run the pre-quantized DeepSeek-R1 LLM model weights on your Arm-based server
150
+
## Run the DeepSeek-R1 Chatbot on your Arm server
153
151
154
152
As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for performance optimization with three types of GEMV/GEMM kernels corresponding to three processor types:
155
153
156
-
* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels),
157
-
* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support, and
158
-
* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support
154
+
* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels).
155
+
* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support.
156
+
* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support.
159
157
160
158
With the latest commits in `llama.cpp` you will see improvements for these Arm optimized kernels directly on your Arm-based server. You can run the pre-quantized Q4_0 model as is and do not need to re-quantize the model.
161
159
@@ -167,7 +165,9 @@ Run the pre-quantized DeepSeek-R1 model exactly as the weights were downloaded f
167
165
168
166
This command will use the downloaded model (`-m` flag), disable conversation mode explicitly (`-no-cnv` flag), adjust the randomness of the generated text (`--temp` flag), with the specified prompt (`-p` flag), and target a 512 token completion (`-n` flag), using 64 threads (`-t` flag).
169
167
170
-
You may notice there are many gguf files downloaded, llama.cpp can load all series of files by passing the first one with `-m` flag.
168
+
You might notice there are many gguf files downloaded. Llama.cpp can load all series of files by passing the first one with `-m` flag.
169
+
170
+
## Analyze the output and performance statistics
171
171
172
172
You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton4 r8g.24xlarge instance is shown below:
173
173
@@ -380,10 +380,10 @@ llama_perf_context_print: total time = 42340.53 ms / 531 tokens
380
380
381
381
The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton4 instance, you will see:
382
382
383
-
* NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions
384
-
* ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions
385
-
* MATMUL_INT8 = 1 This flag indicates support for Arm int8 matrix multiplication instructions
386
-
* SVE = 1 This flag indicates support for the Arm Scalable Vector Extension
383
+
* NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions.
384
+
* ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions.
385
+
* MATMUL_INT8 = 1 This flag indicates support for Arm int8 matrix multiplication instructions.
386
+
* SVE = 1 This flag indicates support for the Arm Scalable Vector Extension.
387
387
388
388
389
389
The end of the output shows several model timings:
@@ -392,5 +392,7 @@ The end of the output shows several model timings:
392
392
* prompt eval time refers to the time taken to process the prompt before generating the new text.
393
393
* eval time refers to the time taken to generate the output. Generally anything above 10 tokens per second is faster than what humans can read.
394
394
395
+
## What's next?
396
+
395
397
You have successfully run a LLM chatbot with Arm KleidiAI optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-server.md
+12-3Lines changed: 12 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@ weight: 4
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Start the LLM server with llama.cpp
10
+
9
11
You can use the `llama.cpp` server program and submit requests using an OpenAI-compatible API.
10
12
This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.
0 commit comments