Merge pull request #2192 from juliensimon/arcee-foundation-model-on-aws

jasonrandrews · web-flow · commit 9f48e6b641a2 · 2025-07-30T10:26:05.000-05:00
New learning path: Deploy Arcee AFM-4.5B on AWS Graviton4 - final version
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/00_overview.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/00_overview.md
@@ -8,9 +8,9 @@ layout: learningpathall
 
 ## The AFM-4.5B model
 
-AFM-4.5B is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 7 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.
+[AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 8 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.
 
-In this Learning Path, you'll deploy AFM-4.5B using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based AWS Graviton4 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.
+In this Learning Path, you'll deploy [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based AWS Graviton4 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.
 
 This hands-on guide helps developers build cost-efficient, high-performance LLM applications on modern Arm server infrastructure using open-source tools and real-world deployment practices.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md
@@ -6,10 +6,19 @@ weight: 7
 layout: learningpathall
 ---
 
-In this step, you’ll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
+In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
+
+**Note: if you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) versions are available.**
 
 Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on AWS Graviton4.
 
+## Signing up to Hugging Face
+
+In order to download AFM-4.5B, you will need:
+- a Hugging Face account: you can sign up at [https://huggingface.co](https://huggingface.co)
+- a read-only Hugging Face token: once logged in, you can create one at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Don't forget to store it, as you will only be able to view it once.
+- to accept the terms of AFM-4.5B at [https://huggingface.co/arcee-ai/AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
+
 ## Install the Hugging Face libraries
 
 ```bash
@@ -19,14 +28,31 @@ pip install huggingface_hub hf_xet
 This command installs:
 
 - `huggingface_hub`: Python client for downloading models and datasets
--  `hf_xet`: Git extension for fetching large model files stored on Hugging Face
+- `hf_xet`: Git extension for fetching large model files stored on Hugging Face
+
+These tools include the `hf` command-line interface you'll use next.
+
+## Login to the Hugging Face Hub
+
+```bash
+hf auth login
+
+    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
+    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
+    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
+    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
+    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
+
+    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
+Enter your token (input will not be visible):
+```
 
-These tools include the `huggingface-cli` command-line interface you'll use next.
+Please enter the token you created above, and answer 'n' to "Add token as git credential? (Y/n)".
 
 ## Download the AFM-4.5B model
 
 ```bash
-huggingface-cli download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
+hf download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
 ```
 
 This command downloads the model to the `models/afm-4-5b` directory:
@@ -89,7 +115,7 @@ Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization tha
 
 ## Model files ready for inference
 
-After completing these steps, you'll have three versions of the AFM-4.5B model:
+After completing these steps, you'll have three versions of the AFM-4.5B model in `models/afm-4-5b`:
 - `afm-4-5B-F16.gguf` - The original full-precision model (~15GB)
 - `afm-4-5B-Q4_0.gguf` - 4-bit quantized version (~4.4GB) for memory-constrained environments
 - `afm-4-5B-Q8_0.gguf` - 8-bit quantized version (~8GB) for balanced performance and memory usage
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md
@@ -6,7 +6,7 @@ weight: 8
 layout: learningpathall
 ---
 
-Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
+Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
 
 
 ## Use llama-cli for interactive text generation
@@ -55,14 +55,15 @@ To exit the session, type `Ctrl+C` or `/bye`.
 You'll then see performance metrics like this:
 
 ```bash
-llama_perf_sampler_print:    sampling time =      26.66 ms /   356 runs   (    0.07 ms per token, 13352.84 tokens per second)
-llama_perf_context_print:        load time =     782.72 ms
-llama_perf_context_print: prompt eval time =     392.40 ms /    24 tokens (   16.35 ms per token,    61.16 tokens per second)
-llama_perf_context_print:        eval time =   13173.66 ms /   331 runs   (   39.80 ms per token,    25.13 tokens per second)
-llama_perf_context_print:       total time =  129945.08 ms /   355 tokens
+llama_perf_sampler_print:    sampling time =       9.47 ms /   119 runs   (    0.08 ms per token, 12569.98 tokens per second)
+llama_perf_context_print:        load time =     616.69 ms
+llama_perf_context_print: prompt eval time =     344.39 ms /    23 tokens (   14.97 ms per token,    66.79 tokens per second)
+llama_perf_context_print:        eval time =    9289.81 ms /   352 runs   (   26.39 ms per token,    37.89 tokens per second)
+llama_perf_context_print:       total time =   17446.13 ms /   375 tokens
+llama_perf_context_print:    graphs reused =          0
 ```
 
-In this example, the 8-bit model running on 16 threads generated 355 tokens, at ~25 tokens per second (`eval time`).
+In this example, the 8-bit model running on 16 threads generated 375 tokens, at ~37 tokens per second (`eval time`).
 
 ## Run a non-interactive prompt
 
@@ -77,7 +78,7 @@ This command:
 - Sends a one-time prompt using `-p`
 - Prints the generated response and exits
 
-The 4-bit model delivers faster generation—expect around 40 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance.
+The 4-bit model delivers faster generation—expect around 60 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance.
 
 ## Use llama-server for API access
 
@@ -130,29 +131,29 @@ The response includes the model’s reply and performance metrics:
       "index": 0,
       "message": {
         "role": "assistant",
-        "content": "Quantum computing uses quantum-mechanical phenomena, such as superposition and entanglement, to perform calculations. It allows for multiple possibilities to exist simultaneously, which can speed up certain processes. Unlike classical computers, quantum computers can solve complex problems and simulate systems more efficiently. Quantum bits (qubits) store information, and quantum gates perform operations. Quantum computing has potential applications in fields like cryptography, optimization, and materials science. Its development is an active area of research, with companies like IBM, Google, and Microsoft investing in quantum computing technology."
+        "content": "Quantum computing uses quantum-mechanical phenomena like superposition and entanglement to solve complex problems much faster than classical computers. Instead of binary bits (0 or 1), quantum bits (qubits) can exist in multiple states simultaneously, allowing for parallel processing of vast combinations of possibilities. This enables quantum computers to perform certain calculations exponentially faster, particularly in areas like cryptography, optimization, and drug discovery. However, quantum systems are fragile and prone to errors, requiring advanced error correction techniques. Current quantum computers are still in early stages but show promise for transformative applications."
       }
     }
   ],
-  "created": 1750929895,
+  "created": 1753876147,
   "model": "afm-4-5b",
-  "system_fingerprint": "b5757-716301d1",
+  "system_fingerprint": "b6030-1e15bfd4",
   "object": "chat.completion",
   "usage": {
-    "completion_tokens": 111,
+    "completion_tokens": 115,
     "prompt_tokens": 20,
-    "total_tokens": 131
+    "total_tokens": 135
   },
-  "id": "chatcmpl-tb93ww9iYCErwLJmsV0YLrIadVvpBk4m",
+  "id": "chatcmpl-0Zwzu03zbu77MFx4ogBsqz8E4IdxHOLU",
   "timings": {
-    "prompt_n": 11,
-    "prompt_ms": 105.651,
-    "prompt_per_token_ms": 9.604636363636363,
-    "prompt_per_second": 104.11638318615064,
-    "predicted_n": 111,
-    "predicted_ms": 2725.982,
-    "predicted_per_token_ms": 24.558396396396397,
-    "predicted_per_second": 40.719271073690145
+    "prompt_n": 20,
+    "prompt_ms": 68.37,
+    "prompt_per_token_ms": 3.4185000000000003,
+    "prompt_per_second": 292.525961679099,
+    "predicted_n": 115,
+    "predicted_ms": 1884.943,
+    "predicted_per_token_ms": 16.390808695652172,
+    "predicted_per_second": 61.00980241842857
   }
 }
 ```
@@ -161,7 +162,7 @@ The response includes the model’s reply and performance metrics:
 
 You’ve now successfully:
 
-- Run AFM-4.5B in interactive and non-interactive modes
+- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and non-interactive modes
 - Tested performance with different quantized models
 - Served the model as an OpenAI-compatible API endpoint
 
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md
@@ -26,9 +26,9 @@ bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf
 ```
 
 Typical results on a 16 vCPU instance:
-- **F16 model**: ~15-16 tokens/second, ~15GB memory usage
-- **Q8_0 model**: ~25 tokens/second, ~8GB memory usage  
-- **Q4_0 model**: ~40 tokens/second, ~4.4GB memory usage
+- **F16 model**: ~25 tokens/second, ~9GB memory usage
+- **Q8_0 model**: ~40 tokens/second, ~5GB memory usage
+- **Q4_0 model**: ~60 tokens/second, ~3GB memory usage
 
 Your actual results might vary depending on your specific instance configuration and system load.
 
@@ -40,28 +40,31 @@ Use this command to benchmark performance across prompt sizes and thread counts:
 bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \
   -p 128,256,512 \
   -n 128 \
-  -t 8,16,24
+  -t 4,8,16
 ```
 
 This command does the following:
 - Loads the 4-bit model and runs inference benchmarks
 - `-p`: evaluates prompt lengths of 128, 256, and 512 tokens
 - `-n`: generates 128 tokens
-- `-t`: runs inference using 4, 8, and 24 threads
+- `-t`: runs inference using 4, 8, and 16 threads
 
-Here’s an example of how performance scales across threads and prompt sizes:
+Here’s an example of how performance scales across threads and prompt sizes (pp = prompt processing, tg = text generation):
 
 | model                          |       size |     params | backend    | threads |            test |                  t/s |
 | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |       4 |           pp128 |         62.90 ± 0.08 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |       4 |           pp512 |         57.63 ± 0.06 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |       4 |           tg128 |         15.18 ± 0.02 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |       8 |           pp128 |        116.23 ± 0.04 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |       8 |           pp512 |        106.39 ± 0.03 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |       8 |           tg128 |         25.29 ± 0.05 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |      16 |           pp128 |        206.67 ± 0.10 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |      16 |           pp512 |        190.18 ± 0.03 |
-| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |      16 |           tg128 |         40.99 ± 0.36 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       4 |           pp128 |        106.03 ± 0.21 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       4 |           pp256 |        102.82 ± 0.05 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       4 |           pp512 |         95.41 ± 0.18 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       4 |           tg128 |         24.15 ± 0.02 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       8 |           pp128 |        196.02 ± 0.42 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       8 |           pp256 |        190.23 ± 0.34 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       8 |           pp512 |        177.14 ± 0.31 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |       8 |           tg128 |         40.86 ± 0.11 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |      16 |           pp128 |        346.08 ± 0.62 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |      16 |           pp256 |        336.72 ± 1.43 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |      16 |           pp512 |        315.83 ± 0.22 |
+| arcee 4B Q4_0                  |   2.50 GiB |     4.62 B | CPU        |      16 |           tg128 |         62.39 ± 0.20 |
 
 Even with just four threads, the Q4_0 model achieves comfortable generation speeds. On larger instances, you can run multiple concurrent model processes to support parallel workloads.
 
@@ -102,7 +105,7 @@ To reduce runtime, add the `--chunks` flag to evaluate a subset of the data. For
 
 ## Run the evaluation as a background script
 
-Running a full perplexity evaluation on all three models takes about 5 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
+Running a full perplexity evaluation on all three models takes about 3 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
 
 Create a script named ppl.sh:
 
@@ -119,13 +122,13 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wik
  tail -f ppl.sh.log
  ```
 
-Here are the full results.
+| Model | Generation speed (batch size 1, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) | Perplexity Increase |
+|:-------:|:----------------------:|:------------:|:----------:|:----------------------:|
+| F16     | ~25 tokens per second  | ~9 GB        | 8.4612 +/- 0.06112 | 0 (baseline)           |
+| Q8_0    | ~40 tokens per second  | ~5 GB        | 8.4776 +/- 0.06128 | +0.19%                |
+| Q4_0    | ~60 tokens per second  | ~3 GB        | 9.1897 +/- 0.06604 | +8.6%                   |
 
-| Model | Generation speed (tokens/s, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) |
-|:-------:|:----------------------:|:------------:|:----------:|
-| F16     | ~15–16                 | ~15 GB       | TODO     |
-| Q8_0    | ~25                    | ~8 GB        | TODO       |
-| Q4_0    | ~40                    | ~4.4 GB      | TODO       |
+We can see that 8-bit quantization introduces negligible degradation. The 4-bit model does suffer more, but may still serve its purpose for simpler use cases. As always, you should run your own tests and make up your own mind.
 
 When you have finished your benchmarking and evaluation, make sure to terminate your AWS EC2 instance in the AWS Management Console to avoid incurring unnecessary charges for unused compute resources.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md
@@ -9,7 +9,7 @@ layout: learningpathall
 
 ## Wrap up your AFM-4.5B deployment
 
-Congratulations! You have completed the process of deploying the Arcee AFM-4.5B foundation model on AWS Graviton4.
+Congratulations! You have completed the process of deploying the Arcee [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on AWS Graviton4.
 
 Here’s a summary of what you built and how you can take your knowledge forward.
 
@@ -29,8 +29,8 @@ Using this Learning Path, you have:
 
 The benchmarking results demonstrate the power of quantization and Arm-based computing:
 
-- **Memory efficiency** – the 4-bit model uses only ~4.4 GB of RAM compared to ~15 GB for the full-precision version
-- **Speed improvements** – inference with Q4_0 is 2–3x faster (40+ tokens/sec vs. 15–16 tokens/sec)
+- **Memory efficiency** – the 4-bit model uses only ~3 GB of RAM compared to ~9 GB for the full-precision version
+- **Speed improvements** – inference with Q4_0 is 2.5x faster (~60+ tokens/sec vs. 25 tokens/sec)
 - **Cost optimization** – lower memory needs enable smaller, more affordable instances
 - **Quality preservation** – the quantized models maintain strong perplexity scores, showing minimal quality loss
 
@@ -63,4 +63,4 @@ Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and G
 
 From chatbots and content generation to research tools, this stack strikes a balance between performance, cost, and developer control.
 
-For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit www.arcee.ai.
+For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit [www.arcee.ai](https://www.arcee.ai).