Skip to content

Commit a34a45f

Browse files
Editorial review.
1 parent 4374978 commit a34a45f

File tree

3 files changed

+37
-26
lines changed

3 files changed

+37
-26
lines changed

content/learning-paths/servers-and-cloud-computing/deepseek-cpu/_index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ cascade:
77

88
minutes_to_complete: 30
99

10-
who_is_this_for: This is an introductory topic for developers interested in running DeepSeek-R1 on Arm-based servers.
10+
who_is_this_for: This Learning Path is for developers who want to run DeepSeek-R1 on Arm-based servers.
1111

1212
learning_objectives:
13-
- Download and build llama.cpp on your Arm-based server.
13+
- Clone and build llama.cpp on your Arm-based server.
1414
- Download a pre-quantized DeepSeek-R1 model from Hugging Face.
15-
- Run the pre-quantized model on your Arm CPU and measure the performance.
15+
- Run the model on your Arm CPU and benchmark its performance.
1616

1717
prerequisites:
18-
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server. This Learning Path was tested on an AWS Graviton4 r8g.24xlarge instance.
18+
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud provider or an on-premise Arm server. This Learning Path was tested on an AWS Graviton4 r8g.24xlarge instance.
1919

2020
author:
2121
- Tianyu Li

content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-chatbot.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,10 @@ weight: 3
66
layout: learningpathall
77
---
88

9-
## Before you begin
10-
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 64 cores and 512GB of RAM to run this example. Configure disk storage up to at least 400 GB. The instructions have been tested on an AWS Graviton4 r8g.24xlarge instance.
11-
12-
139
## Background and what you'll build
1410

11+
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 64 cores and 512GB of RAM to run this example. Configure disk storage up to at least 400 GB. The instructions have been tested on an AWS Graviton4 r8g.24xlarge instance.
12+
1513
Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will learn how to run a generative AI inference-based use case of a LLM chatbot on Arm-based CPUs by deploying the [DeepSeek-R1 671B LLM](https://huggingface.co/bartowski/DeepSeek-R1-GGUF) on your Arm-based CPU using `llama.cpp`, optimized for Arm hardware. You'll:
1614

1715
- Build and run `llama.cpp` with Arm-specific performance improvements.
@@ -22,17 +20,17 @@ Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will
2220

2321
[llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project developed by Georgi Gerganov that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud.
2422

25-
## About the DeepSeek-R1 model and GGUF model format
23+
## Understanding the DeepSeek-R1 model and GGUF format
2624

2725
The [DeepSeek-R1 model](https://huggingface.co/deepseek-ai/DeepSeek-R1) from DeepSeek-AI available on Hugging Face, is released under the [MIT License](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE) and free to use for research and commercial purposes.
2826

2927
The DeepSeek-R1 model has 671 billion parameters, based on Mixture of Experts(MoE) architecture. This improves inference speed and maintains model quality. For this example, the full 671 billion (671B) model is used for retaining quality chatbot capability while also running efficiently on your Arm-based CPU.
3028

3129
Traditionally, the training and inference of LLMs has been done on GPUs using full-precision 32-bit (FP32) or half-precision 16-bit (FP16) data type formats for the model parameter and weights. Recently, a new binary model format called GGUF was introduced by the `llama.cpp` team. This new GGUF model format uses compression and quantization techniques that remove the dependency on using FP32 and FP16 data type formats. For example, GGUF supports quantization where model weights that are generally stored as FP16 data types are scaled down to 4-bit integers. This significantly reduces the need for computational resources and the amount of RAM required. These advancements made in the model format and the data types used make Arm CPUs a great fit for running LLM inferences.
3230

33-
## Install dependencies
31+
## Install build dependencies on your Arm-based server
3432

35-
Install the following packages on your Arm based server instance:
33+
Install the following packages:
3634

3735
```bash
3836
sudo apt update
@@ -46,7 +44,7 @@ sudo apt install gcc g++ -y
4644
sudo apt install build-essential -y
4745
```
4846

49-
## Download and build llama.cpp
47+
## Clone and build llama.cpp
5048

5149
You are now ready to start building `llama.cpp`.
5250

@@ -107,7 +105,7 @@ general:
107105
```
108106

109107

110-
## Install Hugging Face Hub
108+
## Set up Hugging Face and download the model
111109

112110
There are a few different ways you can download the DeepSeek-R1 model. In this Learning Path, you download the model from Hugging Face.
113111

@@ -143,19 +141,19 @@ huggingface-cli download bartowski/DeepSeek-R1-GGUF --include "*DeepSeek-R1-Q4_0
143141
```
144142
Before you proceed and run this model, take a quick look at what `Q4_0` in the model name denotes.
145143

146-
## Quantization format
144+
## Understanding the Quantization format
147145

148146
`Q4_0` in the model name refers to the quantization method the model uses. The goal of quantization is to reduce the size of the model (to reduce the memory space required) and faster (to reduce memory bandwidth bottlenecks transferring large amounts of data from memory to a processor). The primary trade-off to keep in mind when reducing a model's size is maintaining quality of performance. Ideally, a model is quantized to meet size and speed requirements while not having a negative impact on performance.
149147

150148
This model is `DeepSeek-R1-Q4_0-00001-of-00010.gguf`, so what does each component mean in relation to the quantization level? The main thing to note is the number of bits per parameter, which is denoted by 'Q4' in this case or 4-bit integer. As a result, by only using 4 bits per parameter for 671 billion parameters, the model drops to be 354 GB in size.
151149

152-
## Run the pre-quantized DeepSeek-R1 LLM model weights on your Arm-based server
150+
## Run the DeepSeek-R1 Chatbot on your Arm server
153151

154152
As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for performance optimization with three types of GEMV/GEMM kernels corresponding to three processor types:
155153

156-
* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels),
157-
* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support, and
158-
* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support
154+
* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels).
155+
* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support.
156+
* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support.
159157

160158
With the latest commits in `llama.cpp` you will see improvements for these Arm optimized kernels directly on your Arm-based server. You can run the pre-quantized Q4_0 model as is and do not need to re-quantize the model.
161159

@@ -167,7 +165,9 @@ Run the pre-quantized DeepSeek-R1 model exactly as the weights were downloaded f
167165

168166
This command will use the downloaded model (`-m` flag), disable conversation mode explicitly (`-no-cnv` flag), adjust the randomness of the generated text (`--temp` flag), with the specified prompt (`-p` flag), and target a 512 token completion (`-n` flag), using 64 threads (`-t` flag).
169167

170-
You may notice there are many gguf files downloaded, llama.cpp can load all series of files by passing the first one with `-m` flag.
168+
You might notice there are many gguf files downloaded. Llama.cpp can load all series of files by passing the first one with `-m` flag.
169+
170+
## Analyze the output and performance statistics
171171

172172
You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton4 r8g.24xlarge instance is shown below:
173173

@@ -380,10 +380,10 @@ llama_perf_context_print: total time = 42340.53 ms / 531 tokens
380380

381381
The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton4 instance, you will see:
382382

383-
* NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions
384-
* ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions
385-
* MATMUL_INT8 = 1 This flag indicates support for Arm int8 matrix multiplication instructions
386-
* SVE = 1 This flag indicates support for the Arm Scalable Vector Extension
383+
* NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions.
384+
* ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions.
385+
* MATMUL_INT8 = 1 This flag indicates support for Arm int8 matrix multiplication instructions.
386+
* SVE = 1 This flag indicates support for the Arm Scalable Vector Extension.
387387

388388

389389
The end of the output shows several model timings:
@@ -392,5 +392,7 @@ The end of the output shows several model timings:
392392
* prompt eval time refers to the time taken to process the prompt before generating the new text.
393393
* eval time refers to the time taken to generate the output. Generally anything above 10 tokens per second is faster than what humans can read.
394394

395+
## What's next?
396+
395397
You have successfully run a LLM chatbot with Arm KleidiAI optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
396398

content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-server.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ weight: 4
66
layout: learningpathall
77
---
88

9+
## Start the LLM server with llama.cpp
10+
911
You can use the `llama.cpp` server program and submit requests using an OpenAI-compatible API.
1012
This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.
1113

@@ -45,7 +47,7 @@ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/jso
4547
}' 2>/dev/null | jq -C
4648
```
4749

48-
The `model` value in the API is not used, you can enter any value. This is because there is only one model loaded in the server.
50+
The `model` value is ignored by the server, so you can use any placeholder string. This is because there is only one model loaded in the server.
4951

5052
Run the script:
5153

@@ -90,9 +92,11 @@ The `curl` command accesses the LLM and you see the output:
9092
}
9193
```
9294

95+
## Inspect the JSON output
96+
9397
In the returned JSON data you see the LLM output, including the content created from the prompt.
9498

95-
## Use Python
99+
## Access the API using Python
96100

97101
You can also use a Python program to access the OpenAI-compatible API.
98102

@@ -121,7 +125,7 @@ client = OpenAI(
121125
completion = client.chat.completions.create(
122126
model="not-used",
123127
messages=[
124-
{"role": "system", "content": "You are a coding assistant, skilled in programming.."},
128+
{"role": "system", "content": "You are a coding assistant, skilled in programming..."},
125129
{"role": "user", "content": "Write a hello world program in C++."}
126130
],
127131
stream=True,
@@ -137,6 +141,9 @@ Run the Python file (make sure the server is still running):
137141
python ./python-test.py
138142
```
139143

144+
## Example Output
145+
146+
140147
You see the output generated by the LLM:
141148

142149
```output
@@ -192,4 +199,6 @@ When compiled and run, this program will display:
192199
193200
Hello World!
194201
```
202+
## What's next?
203+
195204
You can continue to experiment with different large language models and write scripts to try them.

0 commit comments

Comments
 (0)