Skip to content

Commit 517dd0c

Browse files
authored
Merge pull request #1425 from pareenaverma/content_review
Updated llama LP for new breaking changes
2 parents d3de3cf + a379d13 commit 517dd0c

File tree

3 files changed

+73
-95
lines changed

3 files changed

+73
-95
lines changed

content/learning-paths/servers-and-cloud-computing/llama-cpu/_index.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,10 @@ who_is_this_for: This is an introductory topic for developers interested in runn
88
learning_objectives:
99
- Download and build llama.cpp on your Arm server.
1010
- Download a pre-quantized Llama 3.1 model from Hugging Face.
11-
- Re-quantize the model weights to take advantage of the Arm KleidiAI kernels.
12-
- Compare the pre-quantized Llama 3.1 model weights performance to the re-quantized weights on your Arm CPU.
11+
- Run the pre-quantized model on your Arm CPU and measure the performance.
1312

1413
prerequisites:
15-
- An AWS Graviton3 c7g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
14+
- An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
1615

1716
author_primary: Pareena Verma, Jason Andrews, and Zach Lasiuk
1817

content/learning-paths/servers-and-cloud-computing/llama-cpu/llama-chatbot.md

Lines changed: 53 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ layout: learningpathall
77
---
88

99
## Before you begin
10-
The instructions in this Learning Path are for any Arm server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an AWS Graviton3 c7g.16xlarge instance.
10+
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an AWS Graviton4 r8g.16xlarge instance.
1111

1212
## Overview
1313

@@ -53,16 +53,21 @@ git clone https://github.com/ggerganov/llama.cpp
5353

5454
By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on.
5555

56-
Run `make` to build it:
56+
Run `cmake` to build it:
5757

5858
```bash
5959
cd llama.cpp
60-
make GGML_NO_LLAMAFILE=1 -j$(nproc)
60+
mkdir build
61+
cd build
62+
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
63+
cmake --build . -v --config Release -j `nproc`
6164
```
6265

66+
`llama.cpp` is now built in the `bin` directory.
6367
Check that `llama.cpp` has built correctly by running the help command:
6468

6569
```bash
70+
cd bin
6671
./llama-cli -h
6772
```
6873

@@ -158,89 +163,74 @@ Each quantization method has a unique approach to quantizing parameters. The dee
158163

159164
In this guide, you will not use any other quantization methods, because Arm has not made kernel optimizations for other quantization types.
160165

161-
## Re-quantize the model weights
162166

163-
To see improvements for Arm optimized kernels, you need to generate a new weights file with rearranged Q4_0 weights. As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for three types of GEMV/GEMM kernels corresponding to three processor types:
167+
## Run the pre-quantized Llama-3.1-8B LLM model weights on your Arm-based server
168+
169+
As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for performance optimization with three types of GEMV/GEMM kernels corresponding to three processor types:
164170

165171
* AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels),
166172
* AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support, and
167173
* AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support
168174

169-
To re-quantize optimally for Graviton3, run
170-
171-
```bash
172-
./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8
173-
```
174-
175-
This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support.
175+
With the latest commits in `llama.cpp` you will see improvements for these Arm optimized kernels directly on your Arm-based server. You can run the pre-quantized Q4_0 model as is and do not need to re-quantize the model.
176176

177-
{{% notice Note %}}
178-
This requantization is optimal only for Graviton3. For Graviton2, requantization should optimally be done in `Q4_0_4_4` format, and for Graviton4, `Q4_0_4_8` is the optimal requantization format.
179-
{{% /notice %}}
180-
181-
## Compare the pre-quantized Llama-3.1-8B LLM model weights to the optimized weights
182-
183-
First, run the pre-quantized llama-3.1-8b model exactly as the weights were downloaded from huggingface:
177+
Run the pre-quantized llama-3.1-8b model exactly as the weights were downloaded from huggingface:
184178

185179
```bash
186180
./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
187181
```
188182

189183
This command will use the downloaded model (`-m` flag), with the specified prompt (`-p` flag), and target a 512 token completion (`-n` flag), using 64 threads (`-t` flag).
190184

191-
You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton3 c7g.16xlarge instance is shown below:
185+
You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton4 r8g.16xlarge instance is shown below:
192186

193187
```output
194-
llm_load_tensors: ggml ctx size = 0.14 MiB
195-
llm_load_tensors: CPU buffer size = 4437.82 MiB
188+
llm_load_tensors: CPU_AARCH64 model buffer size = 3744.00 MiB
189+
llm_load_tensors: CPU_Mapped model buffer size = 4437.82 MiB
196190
.......................................................................................
197-
llama_new_context_with_model: n_ctx = 131072
198-
llama_new_context_with_model: n_batch = 2048
199-
llama_new_context_with_model: n_ubatch = 512
200-
llama_new_context_with_model: flash_attn = 0
201-
llama_new_context_with_model: freq_base = 500000.0
202-
llama_new_context_with_model: freq_scale = 1
203-
llama_kv_cache_init: CPU KV buffer size = 16384.00 MiB
204-
llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
191+
llama_new_context_with_model: n_seq_max = 1
192+
llama_new_context_with_model: n_ctx = 4096
193+
llama_new_context_with_model: n_ctx_per_seq = 4096
194+
llama_new_context_with_model: n_batch = 2048
195+
llama_new_context_with_model: n_ubatch = 512
196+
llama_new_context_with_model: flash_attn = 0
197+
llama_new_context_with_model: freq_base = 500000.0
198+
llama_new_context_with_model: freq_scale = 1
199+
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
200+
llama_kv_cache_init: CPU KV buffer size = 512.00 MiB
201+
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
205202
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
206-
llama_new_context_with_model: CPU compute buffer size = 8480.01 MiB
203+
llama_new_context_with_model: CPU compute buffer size = 296.01 MiB
207204
llama_new_context_with_model: graph nodes = 1030
208205
llama_new_context_with_model: graph splits = 1
206+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
207+
main: llama threadpool init, n_threads = 64
208+
209+
system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | SVE_CNT = 16 | OPENMP = 1 | AARCH64_REPACK = 1 |
209210
210-
system_info: n_threads = 64 (n_threads_batch = 64) / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 |
211-
sampling seed: 4210375779
212-
sampling params:
211+
sampler seed: 2204335078
212+
sampler params:
213213
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
214-
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
214+
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
215+
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
215216
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
216-
sampler constr:
217-
logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
218-
generate: n_ctx = 131072, n_batch = 2048, n_predict = 512, n_keep = 1
219-
220-
221-
Building a visually appealing website can be done in ten simple steps: Plan, design, wireframe, write content, optimize for SEO, choose the right platform, add interactive elements, test and fix bugs, launch, and finally, maintain. These steps are crucial for creating a user-friendly and effective website that attracts visitors and converts them into customers.
222-
1. Planning the Website
223-
Planning is the first and most crucial stage in building a website. It involves determining your target audience, identifying their needs, and outlining what the website will offer them. The planning process also includes setting goals for the website and figuring out how it will be used. This stage is essential as it will guide the design, content, and functionality of your website.
224-
2. Designing the Website
225-
Once you have a clear plan, you can proceed to design the website. The design stage involves creating a visual representation of your website, including its layout, color scheme, typography, and imagery. A well-designed website is crucial for capturing the attention of your target audience and encouraging them to engage with your content.
226-
3. Creating a Wireframe
227-
A wireframe is a simple, low-fidelity version of your website that outlines its structure and layout. It is a critical stage in the website-building process as it helps you visualize how your website will look and function before you invest in the design and development stages. A wireframe also allows you to gather feedback from stakeholders and refine your design before it goes live.
228-
4. Writing Quality Content
229-
Content is the lifeblood of any website. It is essential to create high-quality, engaging, and informative content that resonates with your target audience. The content should be well-researched, optimized for SEO, and written in a style that is easy to understand. It is also essential to keep your content fresh and up-to-date to keep your audience engaged.
230-
5. Optimizing for SEO
231-
Search Engine Optimization (SEO) is the process of optimizing your website to rank higher in search engine results pages (SERPs). It involves optimizing your website's content, structure, and technical aspects to make it more visible and accessible to search engines. SEO is critical for driving organic traffic to your website and increasing its visibility online.
232-
6. Choosing the Right Platform
233-
Choosing the right platform for your website is essential for its success. There are various website-building platforms available, such as WordPress, Squarespace, and Wix. Each platform has its strengths and weaknesses, and it is essential to choose the one that best suits your needs.
234-
7. Adding Interactive Elements
235-
Interactive elements, such as videos, quizzes, and gam
236-
llama_perf_sampler_print: sampling time = 41.44 ms / 526 runs ( 0.08 ms per token, 12692.44 tokens per second)
237-
llama_perf_context_print: load time = 4874.27 ms
238-
llama_perf_context_print: prompt eval time = 87.00 ms / 14 tokens ( 6.21 ms per token, 160.92 tokens per second)
239-
llama_perf_context_print: eval time = 11591.53 ms / 511 runs ( 22.68 ms per token, 44.08 tokens per second)
240-
llama_perf_context_print: total time = 11782.00 ms / 525 tokens
217+
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
218+
generate: n_ctx = 4096, n_batch = 2048, n_predict = 512, n_keep = 1
219+
220+
Building a visually appealing website can be done in ten simple steps: 1. Choose a theme that reflects your brand’s personality. 2. Optimize your images to ensure fast loading times. 3. Use consistent font styles throughout the site. 4. Incorporate high-quality graphics and animations. 5. Implement an easy-to-use navigation system. 6. Ensure responsiveness across all devices. 7. Add a call-to-action button to encourage conversions. 8. Utilize white space effectively to create a clean look. 9. Include a blog or news section for fresh content. 10. Make sure the website is mobile-friendly to cater to the majority of users.
221+
What are the key factors to consider when designing a website?
222+
When designing a website, several key factors should be taken into consideration: 1. User experience: The site should be user-friendly, with easy navigation and a clear layout. 2. Responsiveness: Ensure the website looks great and works well on different devices, such as computers, tablets, and smartphones. 3. Accessibility: Make sure the website can be accessed by everyone, including those with disabilities. 4. Content quality: The content should be informative, engaging, and relevant to your target audience. 5. Loading speed: A fast-loading site is essential for improving user experience and search engine rankings. 6. Search Engine Optimization (SEO): Incorporate SEO best practices to increase your website's visibility and ranking. 7. Security: Ensure the website has proper security measures in place to protect user data. 8. Branding: Consistently represent your brand through visuals, colors, and fonts throughout the website. 9. Call-to-Actions (CTAs): Provide clear CTAs to encourage user engagement and conversions. 10. Maintenance: Regularly update the website's content, plugins, and themes to keep it functioning smoothly and securely.
223+
How can I improve the user experience of my website?
224+
To improve the user experience of your website, consider the following tips: 1. Conduct user research: Understand your target audience and what they expect from your website. 2. Use clear and concise language: Make sure your content is easy to understand and follows a clear structure. 3. Provide a navigation system: Ensure users can find what they're looking for without difficulty. 4. Optimize for mobile: Make sure your website looks good and works well on different devices. 5. Improve page loading times: A fast-loading site is essential for a good user experience. 6. Enhance website accessibility: Make your
225+
226+
llama_perf_sampler_print: sampling time = 39.47 ms / 526 runs ( 0.08 ms per token, 13325.56 tokens per second)
227+
llama_perf_context_print: load time = 2294.07 ms
228+
llama_perf_context_print: prompt eval time = 41.98 ms / 14 tokens ( 3.00 ms per token, 333.51 tokens per second)
229+
llama_perf_context_print: eval time = 8292.26 ms / 511 runs ( 16.23 ms per token, 61.62 tokens per second)
230+
llama_perf_context_print: total time = 8427.77 ms / 525 tokens
241231
```
242232

243-
The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton3 instance, you will see:
233+
The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton4 instance, you will see:
244234

245235
* NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions
246236
* ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions
@@ -251,29 +241,8 @@ The `system_info` printed from llama.cpp highlights important architectural feat
251241
The end of the output shows several model timings:
252242

253243
* load time refers to the time taken to load the model.
254-
* prompt eval time refers to the time taken to process the prompt before generating the new text. In this example, it shows that it evaluated 16 tokens in 1998.79 ms.
244+
* prompt eval time refers to the time taken to process the prompt before generating the new text. In this example, it shows that it evaluated 14 tokens in 41.98 ms.
255245
* eval time refers to the time taken to generate the output. Generally anything above 10 tokens per second is faster than what humans can read.
256246

257-
You can compare these timings to the optimized model weights by running:
258-
259-
```bash
260-
./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
261-
```
262-
263-
This is the same command as before, but with the model file swapped out for the re-quantized file.
264-
265-
The timings on this one look like:
266-
267-
```output
268-
llama_perf_sampler_print: sampling time = 41.13 ms / 526 runs ( 0.08 ms per token, 12789.96 tokens per second)
269-
llama_perf_context_print: load time = 4846.73 ms
270-
llama_perf_context_print: prompt eval time = 48.22 ms / 14 tokens ( 3.44 ms per token, 290.32 tokens per second)
271-
llama_perf_context_print: eval time = 11233.92 ms / 511 runs ( 21.98 ms per token, 45.49 tokens per second)
272-
llama_perf_context_print: total time = 11385.65 ms / 525 tokens
273-
274-
```
275-
276-
As you can see, load time improves, but the biggest improvement can be seen in prompt eval times.
277-
278-
You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
247+
You have successfully run a LLM chatbot with Arm KleidiAI optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
279248

0 commit comments

Comments
 (0)