You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/llama-cpu/_index.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,11 +8,10 @@ who_is_this_for: This is an introductory topic for developers interested in runn
8
8
learning_objectives:
9
9
- Download and build llama.cpp on your Arm server.
10
10
- Download a pre-quantized Llama 3.1 model from Hugging Face.
11
-
- Re-quantize the model weights to take advantage of the Arm KleidiAI kernels.
12
-
- Compare the pre-quantized Llama 3.1 model weights performance to the re-quantized weights on your Arm CPU.
11
+
- Run the pre-quantized model on your Arm CPU and measure the performance.
13
12
14
13
prerequisites:
15
-
- An AWS Graviton3 c7g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
14
+
- An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
16
15
17
16
author_primary: Pareena Verma, Jason Andrews, and Zach Lasiuk
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/llama-cpu/llama-chatbot.md
+53-84Lines changed: 53 additions & 84 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ layout: learningpathall
7
7
---
8
8
9
9
## Before you begin
10
-
The instructions in this Learning Path are for any Arm server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an AWS Graviton3 c7g.16xlarge instance.
10
+
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an AWS Graviton4 r8g.16xlarge instance.
By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on.
Check that `llama.cpp` has built correctly by running the help command:
64
68
65
69
```bash
70
+
cd bin
66
71
./llama-cli -h
67
72
```
68
73
@@ -158,89 +163,74 @@ Each quantization method has a unique approach to quantizing parameters. The dee
158
163
159
164
In this guide, you will not use any other quantization methods, because Arm has not made kernel optimizations for other quantization types.
160
165
161
-
## Re-quantize the model weights
162
166
163
-
To see improvements for Arm optimized kernels, you need to generate a new weights file with rearranged Q4_0 weights. As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for three types of GEMV/GEMM kernels corresponding to three processor types:
167
+
## Run the pre-quantized Llama-3.1-8B LLM model weights on your Arm-based server
168
+
169
+
As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for performance optimization with three types of GEMV/GEMM kernels corresponding to three processor types:
164
170
165
171
* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels),
166
172
* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support, and
167
173
* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support
This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support.
175
+
With the latest commits in `llama.cpp` you will see improvements for these Arm optimized kernels directly on your Arm-based server. You can run the pre-quantized Q4_0 model as is and do not need to re-quantize the model.
176
176
177
-
{{% notice Note %}}
178
-
This requantization is optimal only for Graviton3. For Graviton2, requantization should optimally be done in `Q4_0_4_4` format, and for Graviton4, `Q4_0_4_8` is the optimal requantization format.
179
-
{{% /notice %}}
180
-
181
-
## Compare the pre-quantized Llama-3.1-8B LLM model weights to the optimized weights
182
-
183
-
First, run the pre-quantized llama-3.1-8b model exactly as the weights were downloaded from huggingface:
177
+
Run the pre-quantized llama-3.1-8b model exactly as the weights were downloaded from huggingface:
184
178
185
179
```bash
186
180
./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
187
181
```
188
182
189
183
This command will use the downloaded model (`-m` flag), with the specified prompt (`-p` flag), and target a 512 token completion (`-n` flag), using 64 threads (`-t` flag).
190
184
191
-
You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton3 c7g.16xlarge instance is shown below:
185
+
You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton4 r8g.16xlarge instance is shown below:
192
186
193
187
```output
194
-
llm_load_tensors: ggml ctx size = 0.14 MiB
195
-
llm_load_tensors: CPU buffer size = 4437.82 MiB
188
+
llm_load_tensors: CPU_AARCH64 model buffer size = 3744.00 MiB
189
+
llm_load_tensors: CPU_Mapped model buffer size = 4437.82 MiB
Building a visually appealing website can be done in ten simple steps: Plan, design, wireframe, write content, optimize for SEO, choose the right platform, add interactive elements, test and fix bugs, launch, and finally, maintain. These steps are crucial for creating a user-friendly and effective website that attracts visitors and converts them into customers.
222
-
1. Planning the Website
223
-
Planning is the first and most crucial stage in building a website. It involves determining your target audience, identifying their needs, and outlining what the website will offer them. The planning process also includes setting goals for the website and figuring out how it will be used. This stage is essential as it will guide the design, content, and functionality of your website.
224
-
2. Designing the Website
225
-
Once you have a clear plan, you can proceed to design the website. The design stage involves creating a visual representation of your website, including its layout, color scheme, typography, and imagery. A well-designed website is crucial for capturing the attention of your target audience and encouraging them to engage with your content.
226
-
3. Creating a Wireframe
227
-
A wireframe is a simple, low-fidelity version of your website that outlines its structure and layout. It is a critical stage in the website-building process as it helps you visualize how your website will look and function before you invest in the design and development stages. A wireframe also allows you to gather feedback from stakeholders and refine your design before it goes live.
228
-
4. Writing Quality Content
229
-
Content is the lifeblood of any website. It is essential to create high-quality, engaging, and informative content that resonates with your target audience. The content should be well-researched, optimized for SEO, and written in a style that is easy to understand. It is also essential to keep your content fresh and up-to-date to keep your audience engaged.
230
-
5. Optimizing for SEO
231
-
Search Engine Optimization (SEO) is the process of optimizing your website to rank higher in search engine results pages (SERPs). It involves optimizing your website's content, structure, and technical aspects to make it more visible and accessible to search engines. SEO is critical for driving organic traffic to your website and increasing its visibility online.
232
-
6. Choosing the Right Platform
233
-
Choosing the right platform for your website is essential for its success. There are various website-building platforms available, such as WordPress, Squarespace, and Wix. Each platform has its strengths and weaknesses, and it is essential to choose the one that best suits your needs.
234
-
7. Adding Interactive Elements
235
-
Interactive elements, such as videos, quizzes, and gam
236
-
llama_perf_sampler_print: sampling time = 41.44 ms / 526 runs ( 0.08 ms per token, 12692.44 tokens per second)
237
-
llama_perf_context_print: load time = 4874.27 ms
238
-
llama_perf_context_print: prompt eval time = 87.00 ms / 14 tokens ( 6.21 ms per token, 160.92 tokens per second)
239
-
llama_perf_context_print: eval time = 11591.53 ms / 511 runs ( 22.68 ms per token, 44.08 tokens per second)
240
-
llama_perf_context_print: total time = 11782.00 ms / 525 tokens
Building a visually appealing website can be done in ten simple steps: 1. Choose a theme that reflects your brand’s personality. 2. Optimize your images to ensure fast loading times. 3. Use consistent font styles throughout the site. 4. Incorporate high-quality graphics and animations. 5. Implement an easy-to-use navigation system. 6. Ensure responsiveness across all devices. 7. Add a call-to-action button to encourage conversions. 8. Utilize white space effectively to create a clean look. 9. Include a blog or news section for fresh content. 10. Make sure the website is mobile-friendly to cater to the majority of users.
221
+
What are the key factors to consider when designing a website?
222
+
When designing a website, several key factors should be taken into consideration: 1. User experience: The site should be user-friendly, with easy navigation and a clear layout. 2. Responsiveness: Ensure the website looks great and works well on different devices, such as computers, tablets, and smartphones. 3. Accessibility: Make sure the website can be accessed by everyone, including those with disabilities. 4. Content quality: The content should be informative, engaging, and relevant to your target audience. 5. Loading speed: A fast-loading site is essential for improving user experience and search engine rankings. 6. Search Engine Optimization (SEO): Incorporate SEO best practices to increase your website's visibility and ranking. 7. Security: Ensure the website has proper security measures in place to protect user data. 8. Branding: Consistently represent your brand through visuals, colors, and fonts throughout the website. 9. Call-to-Actions (CTAs): Provide clear CTAs to encourage user engagement and conversions. 10. Maintenance: Regularly update the website's content, plugins, and themes to keep it functioning smoothly and securely.
223
+
How can I improve the user experience of my website?
224
+
To improve the user experience of your website, consider the following tips: 1. Conduct user research: Understand your target audience and what they expect from your website. 2. Use clear and concise language: Make sure your content is easy to understand and follows a clear structure. 3. Provide a navigation system: Ensure users can find what they're looking for without difficulty. 4. Optimize for mobile: Make sure your website looks good and works well on different devices. 5. Improve page loading times: A fast-loading site is essential for a good user experience. 6. Enhance website accessibility: Make your
225
+
226
+
llama_perf_sampler_print: sampling time = 39.47 ms / 526 runs ( 0.08 ms per token, 13325.56 tokens per second)
227
+
llama_perf_context_print: load time = 2294.07 ms
228
+
llama_perf_context_print: prompt eval time = 41.98 ms / 14 tokens ( 3.00 ms per token, 333.51 tokens per second)
229
+
llama_perf_context_print: eval time = 8292.26 ms / 511 runs ( 16.23 ms per token, 61.62 tokens per second)
230
+
llama_perf_context_print: total time = 8427.77 ms / 525 tokens
241
231
```
242
232
243
-
The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton3 instance, you will see:
233
+
The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton4 instance, you will see:
244
234
245
235
* NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions
246
236
* ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions
@@ -251,29 +241,8 @@ The `system_info` printed from llama.cpp highlights important architectural feat
251
241
The end of the output shows several model timings:
252
242
253
243
* load time refers to the time taken to load the model.
254
-
* prompt eval time refers to the time taken to process the prompt before generating the new text. In this example, it shows that it evaluated 16 tokens in 1998.79 ms.
244
+
* prompt eval time refers to the time taken to process the prompt before generating the new text. In this example, it shows that it evaluated 14 tokens in 41.98 ms.
255
245
* eval time refers to the time taken to generate the output. Generally anything above 10 tokens per second is faster than what humans can read.
256
246
257
-
You can compare these timings to the optimized model weights by running:
258
-
259
-
```bash
260
-
./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
261
-
```
262
-
263
-
This is the same command as before, but with the model file swapped out for the re-quantized file.
264
-
265
-
The timings on this one look like:
266
-
267
-
```output
268
-
llama_perf_sampler_print: sampling time = 41.13 ms / 526 runs ( 0.08 ms per token, 12789.96 tokens per second)
269
-
llama_perf_context_print: load time = 4846.73 ms
270
-
llama_perf_context_print: prompt eval time = 48.22 ms / 14 tokens ( 3.44 ms per token, 290.32 tokens per second)
271
-
llama_perf_context_print: eval time = 11233.92 ms / 511 runs ( 21.98 ms per token, 45.49 tokens per second)
272
-
llama_perf_context_print: total time = 11385.65 ms / 525 tokens
273
-
274
-
```
275
-
276
-
As you can see, load time improves, but the biggest improvement can be seen in prompt eval times.
277
-
278
-
You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
247
+
You have successfully run a LLM chatbot with Arm KleidiAI optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
0 commit comments