Content dev

madeline-underwood · madeline-underwood · commit dc6c7c4310db · 2025-08-20T20:38:32.000Z
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/05_downloading_and_optimizing_afm45b.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/05_downloading_and_optimizing_afm45b.md
@@ -1,38 +1,39 @@
 ---
-title: Download and optimize the AFM-4.5B model
+title: Download and optimize the AFM-4.5B model for Llama.cpp
 weight: 7
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
+In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with Llama.cpp, and generate quantized versions to optimize memory usage and inference speed.
 
-**Note: if you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) versions are available.**
+**Note:** If you want to skip model optimization, pre-converted [GGUF versions](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) are available.  
 
-Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on Google Axion.
+Make sure your Python virtual environment is activated before running commands. These instructions show you how to prepare AFM-4.5B for efficient inference on Google Cloud Axion Arm64 with Llama.cpp.
 
-## Signing up to Hugging Face
+## Sign up to Hugging Face
 
-In order to download AFM-4.5B, you will need to:
-- sign up for a Hugging Face account at [https://huggingface.co](https://huggingface.co)
-- create a read-only Hugging Face token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Don't forget to store it, as you will only be able to view it once.
-- accept the terms of AFM-4.5B at [https://huggingface.co/arcee-ai/AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
+To download AFM-4.5B, you need to:
 
-## Install the Hugging Face libraries
+- Sign up for a Hugging Face account at [https://huggingface.co](https://huggingface.co)  
+- Create a read-only token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (store it securely; it is only shown once)  
+- Accept the model terms at [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)  
+
+## Install Hugging Face libraries
 
 ```bash
 pip install huggingface_hub hf_xet --upgrade
 ```
 
-This command installs the most up to date versions of:
+This installs:
 
-- `huggingface_hub`: Python client for downloading models and datasets
-- `hf_xet`: Git extension for fetching large model files stored on Hugging Face
+- `huggingface_hub`: Python client for downloading models and datasets  
+- `hf_xet`: Git extension for fetching large model files from Hugging Face  
 
-These tools include the `hf` command-line interface you'll use next.
+These tools include the `hf` CLI.
 
-## Login to the Hugging Face Hub
+## Log in to Hugging Face Hub
 
 ```bash
 hf auth login
@@ -49,7 +50,7 @@ Enter your token (input will not be visible):
 
 Please enter the token you created above, and answer 'n' to "Add token as git credential? (Y/n)".
 
-## Download the AFM-4.5B model
+## Download AFM-4.5B from Hugging Face
 
 ```bash
 hf download arcee-ai/afm-4.5B --local-dir models/afm-4-5b/
@@ -60,7 +61,7 @@ This command downloads the model to the `models/afm-4-5b` directory:
 - The download includes the model weights, configuration files, and tokenizer data.
 - This is a 4.5 billion parameter model, so the download can take several minutes depending on your internet connection.
 
-## Convert to GGUF format
+## Convert AFM-4.5B to GGUF format
 
 ```bash
 python3 convert_hf_to_gguf.py models/afm-4-5b
@@ -76,7 +77,7 @@ This command converts the downloaded Hugging Face model to GGUF (GGML Universal
 
 Next, deactivate the Python virtual environment as future commands won't require it.
 
-## Create Q4_0 Quantized Version
+## Create a Q4_0 quantized version
 
 ```bash
 bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q4_0.gguf Q4_0
@@ -90,7 +91,7 @@ This command creates a 4-bit quantized version of the model:
 - The quantized model will use less memory and run faster, though with a small reduction in accuracy.
 - The output file will be `afm-4-5B-Q4_0.gguf`.
 
-## Arm optimization 
+## Arm optimizations for quantized models
 
 Arm has contributed optimized kernels for Q4_0 that use Neoverse V2 instruction sets. These low-level routines accelerate math operations, delivering strong performance on Axion.
 
@@ -113,11 +114,11 @@ This command creates an 8-bit quantized version of the model:
 
 Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse V2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.
 
-## Model files ready for inference
+## AFM-4.5B models ready for inference
 
 After completing these steps, you'll have three versions of the AFM-4.5B model in `models/afm-4-5b`:
 - `afm-4-5B-F16.gguf` - The original full-precision model (~15GB)
 - `afm-4-5B-Q4_0.gguf` - 4-bit quantized version (~4.4GB) for memory-constrained environments
 - `afm-4-5B-Q8_0.gguf` - 8-bit quantized version (~8GB) for balanced performance and memory usage
 
-These models are now ready to be used with the `llama.cpp` inference engine for text generation and other language model tasks.
+These models are now ready to use with the Llama.cpp inference engine on Google Cloud Axion Arm64.
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/06_running_inference.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/06_running_inference.md
@@ -1,15 +1,16 @@
 ---
-title: Run inference with AFM-4.5B
+title: Run inference with AFM-4.5B using Llama.cpp
 weight: 8
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
+Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference on Google Cloud Axion Arm64 using various Llama.cpp tools. In this step, you’ll generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
 
+## Use llama-cli for interactive inference
 
-## Use llama-cli for interactive text generation
+The `llama-cli` tool provides an interactive command-line interface for text generation. This is useful for quick testing and exploring model behavior.
 
 The `llama-cli` tool provides an interactive command-line interface for text generation. This is ideal for quick testing and hands-on exploration of the model's behavior.
 
@@ -19,14 +20,14 @@ The `llama-cli` tool provides an interactive command-line interface for text gen
 bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -n 256 --color
 ```
 
-This command starts an interactive session:
+This starts an interactive session:
 
-- `-m` (model file path) specifies the model file to load
-- `-n 256` sets the maximum number of tokens to generate per response
-- `--color` enables colored terminal output
-- The tool will prompt you to enter text, and the model will generate a response
+- `-m`: specifies the model file to load  
+- `-n 256`: sets the maximum tokens per response  
+- `--color`: enables colored terminal output  
+- You’ll be prompted to enter text, and the model generates a response  
 
-In this example, `llama-cli` uses 16 vCPUs. You can try different values with `-t <number>`.
+By default, `llama-cli` uses 16 vCPUs. You can change this with `-t <number>`.
 
 ### Example interactive session
 
@@ -63,28 +64,30 @@ llama_perf_context_print:       total time =   17446.13 ms /   375 tokens
 llama_perf_context_print:    graphs reused =          0
 ```
 
-In this example, the 8-bit model running on 16 threads generated 375 tokens, at ~37 tokens per second (`eval time`).
+Here, the 8-bit model on 16 threads produced ~37 tokens per second.
 
-## Run a non-interactive prompt
+## Run a one-time prompt with llama-cli
 
-You can also use `llama-cli` in one-shot mode with a prompt:
+You can run `llama-cli` in non-interactive mode:
 
 ```bash
 bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechanism in transformer models."
 ```
+
 This command:
-- Loads the 4-bit model
-- Disables conversation mode using  `-no-cnv`
-- Sends a one-time prompt using `-p`
-- Prints the generated response and exits
 
-The 4-bit model delivers faster generation—expect around 60 tokens per second on Axion. This shows how a more aggressive quantization recipe helps deliver faster performance.
+- Loads the 4-bit model  
+- Disables conversation mode with `-no-cnv`  
+- Sends a one-time prompt with `-p`  
+- Prints the response and exits  
+
+On Axion, the 4-bit model generates ~60 tokens per second, showing the speed benefit of aggressive quantization.
 
-## Use llama-server for API access
+## Use llama-server for API-based inference
 
-The `llama-server` tool runs the model as a web server compatible with the OpenAI API format, allowing you to make HTTP requests for text generation. This is useful for integrating the model into applications or for batch processing.
+The `llama-server` tool runs the model as a web server with an OpenAI-compatible API. This allows integration with applications or batch jobs via HTTP requests.
 
-## Start the server
+### Start llama-server
 
 ```bash
 bin/llama-server -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \
@@ -99,16 +102,14 @@ This starts a local server that:
 - Accepts connections on port 8080
 - Supports a 4096-token context window
 
-### Make an API request
+### Send an API request
 
 Once the server is running, you can make requests using curl, or any HTTP client. 
 
 Open a new terminal on the Google Cloud instance, and run:
 
 ```bash
-curl -X POST http://localhost:8080/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
+curl -X POST http://localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
     "model": "afm-4-5b",
     "messages": [
       {
@@ -162,8 +163,8 @@ The response includes the model’s reply and performance metrics:
 
 You’ve now successfully:
 
-- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and non-interactive modes
-- Tested performance with different quantized models
-- Served the model as an OpenAI-compatible API endpoint
+- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and one-shot modes  
+- Compared performance with different quantized models on Axion  
+- Served the model as an OpenAI-compatible API endpoint  
 
-You can also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features.
+You can also use the [OpenAI Python client](https://github.com/openai/openai-python) to send requests programmatically, enabling features like streaming responses.
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/07_evaluating_the_quantized_models.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/07_evaluating_the_quantized_models.md
@@ -1,18 +1,18 @@
 ---
-title: Benchmark and evaluate the quantized models
+title: Benchmark and evaluate AFM-4.5B quantized models on Axion
 weight: 9
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Benchmark performance using llama-bench
+## Benchmark AFM-4.5B performance with llama-bench
 
-Use the [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool to measure model performance, including inference speed and memory usage.
+Use the [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool to measure model performance on Google Cloud Axion Arm64, including inference speed and memory usage.
 
-## Run basic benchmarks
+## Benchmark full, 8-bit, and 4-bit models
 
-Benchmark multiple model versions to compare performance:
+Run benchmarks on multiple versions of AFM-4.5B:
 
 ```bash
 # Benchmark the full-precision model
@@ -25,16 +25,17 @@ bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q8_0.gguf
 bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf
 ```
 
-Typical results on a 16 vCPU instance:
-- **F16 model**: ~25 tokens/second, ~9GB memory usage
-- **Q8_0 model**: ~40 tokens/second, ~5GB memory usage
-- **Q4_0 model**: ~60 tokens/second, ~3GB memory usage
+Typical results on a 16 vCPU Axion instance:
 
-Your actual results might vary depending on your specific instance configuration and system load.
+- **F16 model**: ~25 tokens/sec, ~9GB memory  
+- **Q8_0 model**: ~40 tokens/sec, ~5GB memory  
+- **Q4_0 model**: ~60 tokens/sec, ~3GB memory  
 
-## Run advanced benchmarks
+Results vary depending on system configuration and load.
 
-Use this command to benchmark performance across prompt sizes and thread counts:
+## Run advanced benchmarks with threads and prompts
+
+Benchmark across prompt sizes and thread counts:
 
 ```bash
 bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \
@@ -68,10 +69,11 @@ Here’s an example of how performance scales across threads and prompt sizes (p
 
 Even with just four threads, the Q4_0 model achieves comfortable generation speeds. On larger instances, you can run multiple concurrent model processes to support parallel workloads.
 
-To benchmark batch inference, use [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench).
+For batch inference, use [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench).
 
+## Evaluate AFM-4.5B quality with llama-perplexity
 
-## Evaluate model quality using llama-perplexity
+Perplexity measures how well a model predicts text:
 
 Use the llama-perplexity tool to measure how well each model predicts the next token in a sequence. Perplexity is a measure of how well a language model predicts text. It gives you insight into the model’s confidence and predictive ability, representing the average number of possible next tokens the model considers when predicting each word: 
 
@@ -103,24 +105,26 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wik
 To reduce runtime, add the `--chunks` flag to evaluate a subset of the data. For example: `--chunks 50` runs the evaluation on the first 50 text blocks.
 {{< /notice >}}
 
-## Run the evaluation as a background script
+## Run perplexity evaluation in the background
 
 Running a full perplexity evaluation on all three models takes about 3 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
 
 Create a script named ppl.sh:
 
-For example:
 ```bash
 #!/bin/bash
 # ppl.sh
 bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-F16.gguf -f wikitext-2-raw/wiki.test.raw
 bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw
 bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wiki.test.raw
 ```
+
+Run it:
+
 ```bash
- nohup sh ppl.sh >& ppl.sh.log &
- tail -f ppl.sh.log
- ```
+nohup sh ppl.sh >& ppl.sh.log &
+tail -f ppl.sh.log
+```
 
 | Model | Generation speed (batch size 1, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) | Perplexity Increase |
 |:-------:|:----------------------:|:------------:|:----------:|:----------------------:|
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/08_conclusion.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/08_conclusion.md
@@ -1,17 +1,16 @@
 ---
-title: Review what you built
+title: Review your AFM-4.5B deployment on Axion
 weight: 10
 
-
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Wrap up your AFM-4.5B deployment
+## Review your AFM-4.5B deployment on Google Cloud Axion
 
-Congratulations! You have completed the process of deploying the Arcee [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on Google Axion.
+Congratulations! You have successfully deployed the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on Google Cloud Axion Arm64.  
 
-Here’s a summary of what you built and how you can take your knowledge forward.
+Here’s a summary of what you built and how to extend it.
 
 Using this Learning Path, you have:
 
@@ -34,33 +33,33 @@ The benchmarking results demonstrate the power of quantization and Arm-based com
 - **Cost optimization** – lower memory needs enable smaller, more affordable instances
 - **Quality preservation** – the quantized models maintain strong perplexity scores, showing minimal quality loss
 
-## The Google Axion advantage
+## Benefits of Google Cloud Axion Arm64
 
-Google Axion processors, built on the Arm Neoverse V2 architecture, provide:
+Google Cloud Axion processors, based on Arm Neoverse V2, provide:
 
-- Superior performance per watt compared to x86 alternatives
-- Cost savings of 20–40% for compute-intensive workloads
-- Optimized memory bandwidth and cache hierarchy for AI/ML workloads
-- Native Arm64 support for modern machine learning frameworks
+- Better performance per watt than x86 alternatives  
+- 20–40% cost savings for compute-intensive workloads  
+- Optimized memory bandwidth and cache hierarchy for ML tasks  
+- Native Arm64 support for modern machine learning frameworks  
 
-## Next steps for deploying AFM-4.5B on Arm
+## Next steps with AFM-4.5B on Axion
 
-Now that you have a fully functional AFM-4.5B deployment, here are some ways to extend your learning:
+Now that you have a working deployment, you can extend it further.
 
-**Production deployment**:
-- Set up auto-scaling groups for high availability
-- Implement load balancing for multiple model instances
-- Add monitoring and logging with CloudWatch
-- Secure your API endpoints with proper authentication
+**Production deployment**:  
+- Add auto-scaling for high availability  
+- Implement load balancing for multiple instances  
+- Enable monitoring and logging with CloudWatch  
+- Secure API endpoints with authentication  
 
-**Application development**:
-- Build a web application using the `llama-server` API
-- Create a chatbot or virtual assistant
-- Develop content generation tools
-- Integrate with existing applications via REST APIs
+**Application development**:  
+- Build a web app with the `llama-server` API  
+- Create a chatbot or assistant  
+- Develop content generation tools  
+- Integrate AFM-4.5B into existing apps via REST APIs  
 
-Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Axion's compute capabilities give you everything you need to build scalable, production-grade AI applications. 
+Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Google Cloud Axion provide a scalable, cost-efficient platform for AI.  
 
-From chatbots and content generation to research tools, this stack strikes a balance between performance, cost, and developer control.
+From chatbots and content generation to research tools, this stack delivers a balance of performance, cost, and developer control.  
 
 For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit [www.arcee.ai](https://www.arcee.ai).