diff --git a/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/_index.md b/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/_index.md index 89924f5aef..714e5c72e1 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/_index.md @@ -7,15 +7,15 @@ cascade: minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for developers interested in running DeepSeek-R1 on Arm-based servers. +who_is_this_for: This Learning Path is for developers who want to run DeepSeek-R1 on Arm-based servers. learning_objectives: - - Download and build llama.cpp on your Arm server. + - Clone and build llama.cpp on your Arm-based server. - Download a pre-quantized DeepSeek-R1 model from Hugging Face. - - Run the pre-quantized model on your Arm CPU and measure the performance. + - Run the model on your Arm CPU and benchmark its performance. prerequisites: - - An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server. This Learning Path was tested on an AWS Graviton4 r8g.24xlarge instance. + - An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud provider or an on-premise Arm server. This Learning Path was tested on an AWS Graviton4 r8g.24xlarge instance. author: - Tianyu Li diff --git a/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-chatbot.md b/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-chatbot.md index 7ba5604cc8..f365cae002 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-chatbot.md +++ b/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-chatbot.md @@ -6,17 +6,21 @@ weight: 3 layout: learningpathall --- -## Before you begin +## Background and what you'll build + The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 64 cores and 512GB of RAM to run this example. Configure disk storage up to at least 400 GB. The instructions have been tested on an AWS Graviton4 r8g.24xlarge instance. +Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will learn how to run a generative AI inference-based use case of a LLM chatbot on Arm-based CPUs by deploying the [DeepSeek-R1 671B LLM](https://huggingface.co/bartowski/DeepSeek-R1-GGUF) on your Arm-based CPU using `llama.cpp`, optimized for Arm hardware. You'll: + +- Build and run `llama.cpp` with Arm-specific performance improvements. +- Download a quantized GGUF model from Hugging Face. +- Run and measure performance on a large Arm instance (e.g., AWS Graviton4). -## Overview -Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you learn how to run generative AI inference-based use cases like a LLM chatbot on Arm-based CPUs. You do this by deploying the [DeepSeek-R1 GGUF models](https://huggingface.co/bartowski/DeepSeek-R1-GGUF) on your Arm-based CPU using `llama.cpp`. [llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project developed by Georgi Gerganov that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud. -## About the DeepSeek-R1 model and GGUF model format +## Understanding the DeepSeek-R1 model and GGUF format The [DeepSeek-R1 model](https://huggingface.co/deepseek-ai/DeepSeek-R1) from DeepSeek-AI available on Hugging Face, is released under the [MIT License](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE) and free to use for research and commercial purposes. @@ -24,9 +28,9 @@ The DeepSeek-R1 model has 671 billion parameters, based on Mixture of Experts(Mo Traditionally, the training and inference of LLMs has been done on GPUs using full-precision 32-bit (FP32) or half-precision 16-bit (FP16) data type formats for the model parameter and weights. Recently, a new binary model format called GGUF was introduced by the `llama.cpp` team. This new GGUF model format uses compression and quantization techniques that remove the dependency on using FP32 and FP16 data type formats. For example, GGUF supports quantization where model weights that are generally stored as FP16 data types are scaled down to 4-bit integers. This significantly reduces the need for computational resources and the amount of RAM required. These advancements made in the model format and the data types used make Arm CPUs a great fit for running LLM inferences. -## Install dependencies +## Install build dependencies on your Arm-based server -Install the following packages on your Arm based server instance: +Install the following packages: ```bash sudo apt update @@ -40,7 +44,7 @@ sudo apt install gcc g++ -y sudo apt install build-essential -y ``` -## Download and build llama.cpp +## Clone and build llama.cpp You are now ready to start building `llama.cpp`. @@ -101,7 +105,7 @@ general: ``` -## Install Hugging Face Hub +## Set up Hugging Face and download the model There are a few different ways you can download the DeepSeek-R1 model. In this Learning Path, you download the model from Hugging Face. @@ -137,19 +141,19 @@ huggingface-cli download bartowski/DeepSeek-R1-GGUF --include "*DeepSeek-R1-Q4_0 ``` Before you proceed and run this model, take a quick look at what `Q4_0` in the model name denotes. -## Quantization format +## Understanding the Quantization format `Q4_0` in the model name refers to the quantization method the model uses. The goal of quantization is to reduce the size of the model (to reduce the memory space required) and faster (to reduce memory bandwidth bottlenecks transferring large amounts of data from memory to a processor). The primary trade-off to keep in mind when reducing a model's size is maintaining quality of performance. Ideally, a model is quantized to meet size and speed requirements while not having a negative impact on performance. This model is `DeepSeek-R1-Q4_0-00001-of-00010.gguf`, so what does each component mean in relation to the quantization level? The main thing to note is the number of bits per parameter, which is denoted by 'Q4' in this case or 4-bit integer. As a result, by only using 4 bits per parameter for 671 billion parameters, the model drops to be 354 GB in size. -## Run the pre-quantized DeepSeek-R1 LLM model weights on your Arm-based server +## Run the DeepSeek-R1 Chatbot on your Arm server As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for performance optimization with three types of GEMV/GEMM kernels corresponding to three processor types: -* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels), -* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support, and -* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support +* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels). +* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support. +* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support. With the latest commits in `llama.cpp` you will see improvements for these Arm optimized kernels directly on your Arm-based server. You can run the pre-quantized Q4_0 model as is and do not need to re-quantize the model. @@ -161,7 +165,9 @@ Run the pre-quantized DeepSeek-R1 model exactly as the weights were downloaded f This command will use the downloaded model (`-m` flag), disable conversation mode explicitly (`-no-cnv` flag), adjust the randomness of the generated text (`--temp` flag), with the specified prompt (`-p` flag), and target a 512 token completion (`-n` flag), using 64 threads (`-t` flag). -You may notice there are many gguf files downloaded, llama.cpp can load all series of files by passing the first one with `-m` flag. +You might notice there are many gguf files. Llama.cpp can load all series of files by passing the first one with `-m` flag. + +## Analyze the output and performance statistics You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton4 r8g.24xlarge instance is shown below: @@ -374,10 +380,10 @@ llama_perf_context_print: total time = 42340.53 ms / 531 tokens The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton4 instance, you will see: - * NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions - * ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions - * MATMUL_INT8 = 1 This flag indicates support for Arm int8 matrix multiplication instructions - * SVE = 1 This flag indicates support for the Arm Scalable Vector Extension + * NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions. + * ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions. + * MATMUL_INT8 = 1 This flag indicates support for Arm int8 matrix multiplication instructions. + * SVE = 1 This flag indicates support for the Arm Scalable Vector Extension. The end of the output shows several model timings: @@ -386,5 +392,7 @@ The end of the output shows several model timings: * prompt eval time refers to the time taken to process the prompt before generating the new text. * eval time refers to the time taken to generate the output. Generally anything above 10 tokens per second is faster than what humans can read. +## What's next? + You have successfully run a LLM chatbot with Arm KleidiAI optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts. diff --git a/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-server.md b/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-server.md index dfc1ca43b5..c1302ace6e 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-server.md +++ b/content/learning-paths/servers-and-cloud-computing/deepseek-cpu/deepseek-server.md @@ -6,6 +6,8 @@ weight: 4 layout: learningpathall --- +## Start the LLM server with llama.cpp + You can use the `llama.cpp` server program and submit requests using an OpenAI-compatible API. This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM. @@ -45,7 +47,7 @@ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/jso }' 2>/dev/null | jq -C ``` -The `model` value in the API is not used, you can enter any value. This is because there is only one model loaded in the server. +The `model` value is ignored by the server, so you can use any placeholder string. This is because there is only one model loaded in the server. Run the script: @@ -90,9 +92,11 @@ The `curl` command accesses the LLM and you see the output: } ``` +## Inspect the JSON output + In the returned JSON data you see the LLM output, including the content created from the prompt. -## Use Python +## Access the API using Python You can also use a Python program to access the OpenAI-compatible API. @@ -121,7 +125,7 @@ client = OpenAI( completion = client.chat.completions.create( model="not-used", messages=[ - {"role": "system", "content": "You are a coding assistant, skilled in programming.."}, + {"role": "system", "content": "You are a coding assistant, skilled in programming..."}, {"role": "user", "content": "Write a hello world program in C++."} ], stream=True, @@ -137,6 +141,9 @@ Run the Python file (make sure the server is still running): python ./python-test.py ``` +## Example Output + + You see the output generated by the LLM: ```output @@ -192,4 +199,6 @@ When compiled and run, this program will display: Hello World! ``` +## What's next? + You can continue to experiment with different large language models and write scripts to try them.