Merge pull request #1425 from pareenaverma/content_review

pareenaverma · web-flow · commit 517dd0c0dc4c · 2024-12-09T10:16:41.000-05:00
Updated llama LP for new breaking changes
diff --git a/content/learning-paths/servers-and-cloud-computing/llama-cpu/_index.md b/content/learning-paths/servers-and-cloud-computing/llama-cpu/_index.md
@@ -8,11 +8,10 @@ who_is_this_for: This is an introductory topic for developers interested in runn
 learning_objectives:
     - Download and build llama.cpp on your Arm server.
     - Download a pre-quantized Llama 3.1 model from Hugging Face.
-    - Re-quantize the model weights to take advantage of the Arm KleidiAI kernels.
-    - Compare the pre-quantized Llama 3.1 model weights performance to the re-quantized weights on your Arm CPU.
+    - Run the pre-quantized model on your Arm CPU and measure the performance.
 
 prerequisites:
-    - An AWS Graviton3 c7g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
+    - An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
 
 author_primary: Pareena Verma, Jason Andrews, and Zach Lasiuk
 
diff --git a/content/learning-paths/servers-and-cloud-computing/llama-cpu/llama-chatbot.md b/content/learning-paths/servers-and-cloud-computing/llama-cpu/llama-chatbot.md
@@ -7,7 +7,7 @@ layout: learningpathall
 ---
 
 ## Before you begin
-The instructions in this Learning Path are for any Arm server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an AWS Graviton3 c7g.16xlarge instance.
+The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an AWS Graviton4 r8g.16xlarge instance.
 
 ## Overview
 
@@ -53,16 +53,21 @@ git clone https://github.com/ggerganov/llama.cpp
 
 By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on.
 
-Run `make` to build it:
+Run `cmake` to build it:
 
 ```bash
 cd llama.cpp
-make GGML_NO_LLAMAFILE=1 -j$(nproc)
+mkdir build
+cd build
+cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
+cmake --build . -v --config Release -j `nproc`
 ```
 
+`llama.cpp` is now built in the `bin` directory.
 Check that `llama.cpp` has built correctly by running the help command:
 
 ```bash
+cd bin
 ./llama-cli -h
 ```
 
@@ -158,89 +163,74 @@ Each quantization method has a unique approach to quantizing parameters. The dee
 
 In this guide, you will not use any other quantization methods, because Arm has not made kernel optimizations for other quantization types.
 
-## Re-quantize the model weights
 
-To see improvements for Arm optimized kernels, you need to generate a new weights file with rearranged Q4_0 weights. As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for three types of GEMV/GEMM kernels corresponding to three processor types:
+## Run the pre-quantized Llama-3.1-8B LLM model weights on your Arm-based server
+
+As of [llama.cpp commit 0f1a39f3](https://github.com/ggerganov/llama.cpp/commit/0f1a39f3), Arm has contributed code for performance optimization with three types of GEMV/GEMM kernels corresponding to three processor types:
 
 * AWS Graviton2, where you only have NEON support (you will see less improvement for these GEMV/GEMM kernels),
 * AWS Graviton3, where the GEMV/GEMM kernels exploit both SVE 256 and MATMUL INT8 support, and
 * AWS Graviton4, where the GEMV/GEMM kernels exploit NEON/SVE 128 and MATMUL_INT8 support
 
-To re-quantize optimally for Graviton3, run
-
-```bash
-./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8
-```
-
-This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support.
+With the latest commits in `llama.cpp` you will see improvements for these Arm optimized kernels directly on your Arm-based server. You can run the pre-quantized Q4_0 model as is and do not need to re-quantize the model.
 
-{{% notice Note %}} 
-This requantization is optimal only for Graviton3. For Graviton2, requantization should optimally be done in `Q4_0_4_4` format, and for Graviton4, `Q4_0_4_8` is the optimal requantization format. 
-{{% /notice %}}
-
-## Compare the pre-quantized Llama-3.1-8B LLM model weights to the optimized weights
-
-First, run the pre-quantized llama-3.1-8b model exactly as the weights were downloaded from huggingface:
+Run the pre-quantized llama-3.1-8b model exactly as the weights were downloaded from huggingface:
 
 ```bash
 ./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64 
 ```
 
 This command will use the downloaded model (`-m` flag), with the specified prompt (`-p` flag), and target a 512 token completion (`-n` flag), using 64 threads (`-t` flag).
 
-You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton3 c7g.16xlarge instance is shown below:
+You will see lots of interesting statistics being printed from llama.cpp about the model and the system, followed by the prompt and completion. The tail of the output from running this model on an AWS Graviton4 r8g.16xlarge instance is shown below:
 
 ```output
-llm_load_tensors: ggml ctx size =    0.14 MiB
-llm_load_tensors:        CPU buffer size =  4437.82 MiB
+llm_load_tensors:  CPU_AARCH64 model buffer size =  3744.00 MiB
+llm_load_tensors:   CPU_Mapped model buffer size =  4437.82 MiB
 .......................................................................................
-llama_new_context_with_model: n_ctx      = 131072
-llama_new_context_with_model: n_batch    = 2048
-llama_new_context_with_model: n_ubatch   = 512
-llama_new_context_with_model: flash_attn = 0
-llama_new_context_with_model: freq_base  = 500000.0
-llama_new_context_with_model: freq_scale = 1
-llama_kv_cache_init:        CPU KV buffer size = 16384.00 MiB
-llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
+llama_new_context_with_model: n_seq_max     = 1
+llama_new_context_with_model: n_ctx         = 4096
+llama_new_context_with_model: n_ctx_per_seq = 4096
+llama_new_context_with_model: n_batch       = 2048
+llama_new_context_with_model: n_ubatch      = 512
+llama_new_context_with_model: flash_attn    = 0
+llama_new_context_with_model: freq_base     = 500000.0
+llama_new_context_with_model: freq_scale    = 1
+llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
+llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
+llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
 llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
-llama_new_context_with_model:        CPU compute buffer size =  8480.01 MiB
+llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
 llama_new_context_with_model: graph nodes  = 1030
 llama_new_context_with_model: graph splits = 1
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+main: llama threadpool init, n_threads = 64
+
+system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | SVE_CNT = 16 | OPENMP = 1 | AARCH64_REPACK = 1 |
 
-system_info: n_threads = 64 (n_threads_batch = 64) / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 |
-sampling seed: 4210375779
-sampling params:
+sampler seed: 2204335078
+sampler params:
         repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
-        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
+        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
+        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
         mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
-sampler constr:
-        logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
-generate: n_ctx = 131072, n_batch = 2048, n_predict = 512, n_keep = 1
-
-
-Building a visually appealing website can be done in ten simple steps: Plan, design, wireframe, write content, optimize for SEO, choose the right platform, add interactive elements, test and fix bugs, launch, and finally, maintain. These steps are crucial for creating a user-friendly and effective website that attracts visitors and converts them into customers.
-1. Planning the Website
-Planning is the first and most crucial stage in building a website. It involves determining your target audience, identifying their needs, and outlining what the website will offer them. The planning process also includes setting goals for the website and figuring out how it will be used. This stage is essential as it will guide the design, content, and functionality of your website.
-2. Designing the Website
-Once you have a clear plan, you can proceed to design the website. The design stage involves creating a visual representation of your website, including its layout, color scheme, typography, and imagery. A well-designed website is crucial for capturing the attention of your target audience and encouraging them to engage with your content.
-3. Creating a Wireframe
-A wireframe is a simple, low-fidelity version of your website that outlines its structure and layout. It is a critical stage in the website-building process as it helps you visualize how your website will look and function before you invest in the design and development stages. A wireframe also allows you to gather feedback from stakeholders and refine your design before it goes live.
-4. Writing Quality Content
-Content is the lifeblood of any website. It is essential to create high-quality, engaging, and informative content that resonates with your target audience. The content should be well-researched, optimized for SEO, and written in a style that is easy to understand. It is also essential to keep your content fresh and up-to-date to keep your audience engaged.
-5. Optimizing for SEO
-Search Engine Optimization (SEO) is the process of optimizing your website to rank higher in search engine results pages (SERPs). It involves optimizing your website's content, structure, and technical aspects to make it more visible and accessible to search engines. SEO is critical for driving organic traffic to your website and increasing its visibility online.
-6. Choosing the Right Platform
-Choosing the right platform for your website is essential for its success. There are various website-building platforms available, such as WordPress, Squarespace, and Wix. Each platform has its strengths and weaknesses, and it is essential to choose the one that best suits your needs.
-7. Adding Interactive Elements
-Interactive elements, such as videos, quizzes, and gam
-llama_perf_sampler_print:    sampling time =      41.44 ms /   526 runs   (    0.08 ms per token, 12692.44 tokens per second)
-llama_perf_context_print:        load time =    4874.27 ms
-llama_perf_context_print: prompt eval time =      87.00 ms /    14 tokens (    6.21 ms per token,   160.92 tokens per second)
-llama_perf_context_print:        eval time =   11591.53 ms /   511 runs   (   22.68 ms per token,    44.08 tokens per second)
-llama_perf_context_print:       total time =   11782.00 ms /   525 tokens
+sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
+generate: n_ctx = 4096, n_batch = 2048, n_predict = 512, n_keep = 1
+
+Building a visually appealing website can be done in ten simple steps: 1. Choose a theme that reflects your brand’s personality. 2. Optimize your images to ensure fast loading times. 3. Use consistent font styles throughout the site. 4. Incorporate high-quality graphics and animations. 5. Implement an easy-to-use navigation system. 6. Ensure responsiveness across all devices. 7. Add a call-to-action button to encourage conversions. 8. Utilize white space effectively to create a clean look. 9. Include a blog or news section for fresh content. 10. Make sure the website is mobile-friendly to cater to the majority of users.
+What are the key factors to consider when designing a website?
+When designing a website, several key factors should be taken into consideration: 1. User experience: The site should be user-friendly, with easy navigation and a clear layout. 2. Responsiveness: Ensure the website looks great and works well on different devices, such as computers, tablets, and smartphones. 3. Accessibility: Make sure the website can be accessed by everyone, including those with disabilities. 4. Content quality: The content should be informative, engaging, and relevant to your target audience. 5. Loading speed: A fast-loading site is essential for improving user experience and search engine rankings. 6. Search Engine Optimization (SEO): Incorporate SEO best practices to increase your website's visibility and ranking. 7. Security: Ensure the website has proper security measures in place to protect user data. 8. Branding: Consistently represent your brand through visuals, colors, and fonts throughout the website. 9. Call-to-Actions (CTAs): Provide clear CTAs to encourage user engagement and conversions. 10. Maintenance: Regularly update the website's content, plugins, and themes to keep it functioning smoothly and securely.
+How can I improve the user experience of my website?
+To improve the user experience of your website, consider the following tips: 1. Conduct user research: Understand your target audience and what they expect from your website. 2. Use clear and concise language: Make sure your content is easy to understand and follows a clear structure. 3. Provide a navigation system: Ensure users can find what they're looking for without difficulty. 4. Optimize for mobile: Make sure your website looks good and works well on different devices. 5. Improve page loading times: A fast-loading site is essential for a good user experience. 6. Enhance website accessibility: Make your
+
+llama_perf_sampler_print:    sampling time =      39.47 ms /   526 runs   (    0.08 ms per token, 13325.56 tokens per second)
+llama_perf_context_print:        load time =    2294.07 ms
+llama_perf_context_print: prompt eval time =      41.98 ms /    14 tokens (    3.00 ms per token,   333.51 tokens per second)
+llama_perf_context_print:        eval time =    8292.26 ms /   511 runs   (   16.23 ms per token,    61.62 tokens per second)
+llama_perf_context_print:       total time =    8427.77 ms /   525 tokens
 ```
 
-The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton3 instance, you will see:
+The `system_info` printed from llama.cpp highlights important architectural features present on your hardware that improve the performance of the model execution. In the output shown above from running on an AWS Graviton4 instance, you will see:
 
   * NEON = 1 This flag indicates support for Arm's Neon technology which is an implementation of the Advanced SIMD instructions
   * ARM_FMA = 1 This flag indicates support for Arm Floating-point Multiply and Accumulate instructions 
@@ -251,29 +241,8 @@ The `system_info` printed from llama.cpp highlights important architectural feat
 The end of the output shows several model timings:
 
 * load time refers to the time taken to load the model.
-* prompt eval time refers to the time taken to process the prompt before generating the new text. In this example, it shows that it evaluated 16 tokens in 1998.79 ms. 
+* prompt eval time refers to the time taken to process the prompt before generating the new text. In this example, it shows that it evaluated 14 tokens in 41.98 ms. 
 * eval time refers to the time taken to generate the output. Generally anything above 10 tokens per second is faster than what humans can read.
 
-You can compare these timings to the optimized model weights by running:
-
-```bash
-./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
-```
-
-This is the same command as before, but with the model file swapped out for the re-quantized file.
-
-The timings on this one look like:
-
-```output
-llama_perf_sampler_print:    sampling time =      41.13 ms /   526 runs   (    0.08 ms per token, 12789.96 tokens per second)
-llama_perf_context_print:        load time =    4846.73 ms
-llama_perf_context_print: prompt eval time =      48.22 ms /    14 tokens (    3.44 ms per token,   290.32 tokens per second)
-llama_perf_context_print:        eval time =   11233.92 ms /   511 runs   (   21.98 ms per token,    45.49 tokens per second)
-llama_perf_context_print:       total time =   11385.65 ms /   525 tokens
-
-```
-
-As you can see, load time improves, but the biggest improvement can be seen in prompt eval times.
-
-You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
+You have successfully run a LLM chatbot with Arm KleidiAI optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/llama-cpu/llama-server.md b/content/learning-paths/servers-and-cloud-computing/llama-cpu/llama-server.md