Skip to content

Commit 9f48e6b

Browse files
Merge pull request #2192 from juliensimon/arcee-foundation-model-on-aws
New learning path: Deploy Arcee AFM-4.5B on AWS Graviton4 - final version
2 parents 2ce7d6e + c14c5cd commit 9f48e6b

File tree

5 files changed

+86
-56
lines changed

5 files changed

+86
-56
lines changed

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/00_overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ layout: learningpathall
88

99
## The AFM-4.5B model
1010

11-
AFM-4.5B is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 7 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.
11+
[AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 8 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.
1212

13-
In this Learning Path, you'll deploy AFM-4.5B using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based AWS Graviton4 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.
13+
In this Learning Path, you'll deploy [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based AWS Graviton4 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.
1414

1515
This hands-on guide helps developers build cost-efficient, high-performance LLM applications on modern Arm server infrastructure using open-source tools and real-world deployment practices.
1616

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,19 @@ weight: 7
66
layout: learningpathall
77
---
88

9-
In this step, you’ll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
9+
In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
10+
11+
**Note: if you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) versions are available.**
1012

1113
Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on AWS Graviton4.
1214

15+
## Signing up to Hugging Face
16+
17+
In order to download AFM-4.5B, you will need:
18+
- a Hugging Face account: you can sign up at [https://huggingface.co](https://huggingface.co)
19+
- a read-only Hugging Face token: once logged in, you can create one at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Don't forget to store it, as you will only be able to view it once.
20+
- to accept the terms of AFM-4.5B at [https://huggingface.co/arcee-ai/AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
21+
1322
## Install the Hugging Face libraries
1423

1524
```bash
@@ -19,14 +28,31 @@ pip install huggingface_hub hf_xet
1928
This command installs:
2029

2130
- `huggingface_hub`: Python client for downloading models and datasets
22-
- `hf_xet`: Git extension for fetching large model files stored on Hugging Face
31+
- `hf_xet`: Git extension for fetching large model files stored on Hugging Face
32+
33+
These tools include the `hf` command-line interface you'll use next.
34+
35+
## Login to the Hugging Face Hub
36+
37+
```bash
38+
hf auth login
39+
40+
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
41+
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
42+
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
43+
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
44+
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
45+
46+
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
47+
Enter your token (input will not be visible):
48+
```
2349

24-
These tools include the `huggingface-cli` command-line interface you'll use next.
50+
Please enter the token you created above, and answer 'n' to "Add token as git credential? (Y/n)".
2551

2652
## Download the AFM-4.5B model
2753

2854
```bash
29-
huggingface-cli download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
55+
hf download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
3056
```
3157

3258
This command downloads the model to the `models/afm-4-5b` directory:
@@ -89,7 +115,7 @@ Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization tha
89115

90116
## Model files ready for inference
91117

92-
After completing these steps, you'll have three versions of the AFM-4.5B model:
118+
After completing these steps, you'll have three versions of the AFM-4.5B model in `models/afm-4-5b`:
93119
- `afm-4-5B-F16.gguf` - The original full-precision model (~15GB)
94120
- `afm-4-5B-Q4_0.gguf` - 4-bit quantized version (~4.4GB) for memory-constrained environments
95121
- `afm-4-5B-Q8_0.gguf` - 8-bit quantized version (~8GB) for balanced performance and memory usage

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ weight: 8
66
layout: learningpathall
77
---
88

9-
Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
9+
Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
1010

1111

1212
## Use llama-cli for interactive text generation
@@ -55,14 +55,15 @@ To exit the session, type `Ctrl+C` or `/bye`.
5555
You'll then see performance metrics like this:
5656

5757
```bash
58-
llama_perf_sampler_print: sampling time = 26.66 ms / 356 runs ( 0.07 ms per token, 13352.84 tokens per second)
59-
llama_perf_context_print: load time = 782.72 ms
60-
llama_perf_context_print: prompt eval time = 392.40 ms / 24 tokens ( 16.35 ms per token, 61.16 tokens per second)
61-
llama_perf_context_print: eval time = 13173.66 ms / 331 runs ( 39.80 ms per token, 25.13 tokens per second)
62-
llama_perf_context_print: total time = 129945.08 ms / 355 tokens
58+
llama_perf_sampler_print: sampling time = 9.47 ms / 119 runs ( 0.08 ms per token, 12569.98 tokens per second)
59+
llama_perf_context_print: load time = 616.69 ms
60+
llama_perf_context_print: prompt eval time = 344.39 ms / 23 tokens ( 14.97 ms per token, 66.79 tokens per second)
61+
llama_perf_context_print: eval time = 9289.81 ms / 352 runs ( 26.39 ms per token, 37.89 tokens per second)
62+
llama_perf_context_print: total time = 17446.13 ms / 375 tokens
63+
llama_perf_context_print: graphs reused = 0
6364
```
6465

65-
In this example, the 8-bit model running on 16 threads generated 355 tokens, at ~25 tokens per second (`eval time`).
66+
In this example, the 8-bit model running on 16 threads generated 375 tokens, at ~37 tokens per second (`eval time`).
6667

6768
## Run a non-interactive prompt
6869

@@ -77,7 +78,7 @@ This command:
7778
- Sends a one-time prompt using `-p`
7879
- Prints the generated response and exits
7980

80-
The 4-bit model delivers faster generation—expect around 40 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance.
81+
The 4-bit model delivers faster generation—expect around 60 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance.
8182

8283
## Use llama-server for API access
8384

@@ -130,29 +131,29 @@ The response includes the model’s reply and performance metrics:
130131
"index": 0,
131132
"message": {
132133
"role": "assistant",
133-
"content": "Quantum computing uses quantum-mechanical phenomena, such as superposition and entanglement, to perform calculations. It allows for multiple possibilities to exist simultaneously, which can speed up certain processes. Unlike classical computers, quantum computers can solve complex problems and simulate systems more efficiently. Quantum bits (qubits) store information, and quantum gates perform operations. Quantum computing has potential applications in fields like cryptography, optimization, and materials science. Its development is an active area of research, with companies like IBM, Google, and Microsoft investing in quantum computing technology."
134+
"content": "Quantum computing uses quantum-mechanical phenomena like superposition and entanglement to solve complex problems much faster than classical computers. Instead of binary bits (0 or 1), quantum bits (qubits) can exist in multiple states simultaneously, allowing for parallel processing of vast combinations of possibilities. This enables quantum computers to perform certain calculations exponentially faster, particularly in areas like cryptography, optimization, and drug discovery. However, quantum systems are fragile and prone to errors, requiring advanced error correction techniques. Current quantum computers are still in early stages but show promise for transformative applications."
134135
}
135136
}
136137
],
137-
"created": 1750929895,
138+
"created": 1753876147,
138139
"model": "afm-4-5b",
139-
"system_fingerprint": "b5757-716301d1",
140+
"system_fingerprint": "b6030-1e15bfd4",
140141
"object": "chat.completion",
141142
"usage": {
142-
"completion_tokens": 111,
143+
"completion_tokens": 115,
143144
"prompt_tokens": 20,
144-
"total_tokens": 131
145+
"total_tokens": 135
145146
},
146-
"id": "chatcmpl-tb93ww9iYCErwLJmsV0YLrIadVvpBk4m",
147+
"id": "chatcmpl-0Zwzu03zbu77MFx4ogBsqz8E4IdxHOLU",
147148
"timings": {
148-
"prompt_n": 11,
149-
"prompt_ms": 105.651,
150-
"prompt_per_token_ms": 9.604636363636363,
151-
"prompt_per_second": 104.11638318615064,
152-
"predicted_n": 111,
153-
"predicted_ms": 2725.982,
154-
"predicted_per_token_ms": 24.558396396396397,
155-
"predicted_per_second": 40.719271073690145
149+
"prompt_n": 20,
150+
"prompt_ms": 68.37,
151+
"prompt_per_token_ms": 3.4185000000000003,
152+
"prompt_per_second": 292.525961679099,
153+
"predicted_n": 115,
154+
"predicted_ms": 1884.943,
155+
"predicted_per_token_ms": 16.390808695652172,
156+
"predicted_per_second": 61.00980241842857
156157
}
157158
}
158159
```
@@ -161,7 +162,7 @@ The response includes the model’s reply and performance metrics:
161162

162163
You’ve now successfully:
163164

164-
- Run AFM-4.5B in interactive and non-interactive modes
165+
- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and non-interactive modes
165166
- Tested performance with different quantized models
166167
- Served the model as an OpenAI-compatible API endpoint
167168

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md

Lines changed: 25 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,9 @@ bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf
2626
```
2727

2828
Typical results on a 16 vCPU instance:
29-
- **F16 model**: ~15-16 tokens/second, ~15GB memory usage
30-
- **Q8_0 model**: ~25 tokens/second, ~8GB memory usage
31-
- **Q4_0 model**: ~40 tokens/second, ~4.4GB memory usage
29+
- **F16 model**: ~25 tokens/second, ~9GB memory usage
30+
- **Q8_0 model**: ~40 tokens/second, ~5GB memory usage
31+
- **Q4_0 model**: ~60 tokens/second, ~3GB memory usage
3232

3333
Your actual results might vary depending on your specific instance configuration and system load.
3434

@@ -40,28 +40,31 @@ Use this command to benchmark performance across prompt sizes and thread counts:
4040
bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \
4141
-p 128,256,512 \
4242
-n 128 \
43-
-t 8,16,24
43+
-t 4,8,16
4444
```
4545

4646
This command does the following:
4747
- Loads the 4-bit model and runs inference benchmarks
4848
- `-p`: evaluates prompt lengths of 128, 256, and 512 tokens
4949
- `-n`: generates 128 tokens
50-
- `-t`: runs inference using 4, 8, and 24 threads
50+
- `-t`: runs inference using 4, 8, and 16 threads
5151

52-
Here’s an example of how performance scales across threads and prompt sizes:
52+
Here’s an example of how performance scales across threads and prompt sizes (pp = prompt processing, tg = text generation):
5353

5454
| model | size | params | backend | threads | test | t/s |
5555
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
56-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 4 | pp128 | 62.90 ± 0.08 |
57-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 4 | pp512 | 57.63 ± 0.06 |
58-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 4 | tg128 | 15.18 ± 0.02 |
59-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 8 | pp128 | 116.23 ± 0.04 |
60-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 8 | pp512 | 106.39 ± 0.03 |
61-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 8 | tg128 | 25.29 ± 0.05 |
62-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 16 | pp128 | 206.67 ± 0.10 |
63-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 16 | pp512 | 190.18 ± 0.03 |
64-
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 16 | tg128 | 40.99 ± 0.36 |
56+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 4 | pp128 | 106.03 ± 0.21 |
57+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 4 | pp256 | 102.82 ± 0.05 |
58+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 4 | pp512 | 95.41 ± 0.18 |
59+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 4 | tg128 | 24.15 ± 0.02 |
60+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 8 | pp128 | 196.02 ± 0.42 |
61+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 8 | pp256 | 190.23 ± 0.34 |
62+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 8 | pp512 | 177.14 ± 0.31 |
63+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 8 | tg128 | 40.86 ± 0.11 |
64+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 16 | pp128 | 346.08 ± 0.62 |
65+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 16 | pp256 | 336.72 ± 1.43 |
66+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 16 | pp512 | 315.83 ± 0.22 |
67+
| arcee 4B Q4_0 | 2.50 GiB | 4.62 B | CPU | 16 | tg128 | 62.39 ± 0.20 |
6568

6669
Even with just four threads, the Q4_0 model achieves comfortable generation speeds. On larger instances, you can run multiple concurrent model processes to support parallel workloads.
6770

@@ -102,7 +105,7 @@ To reduce runtime, add the `--chunks` flag to evaluate a subset of the data. For
102105

103106
## Run the evaluation as a background script
104107

105-
Running a full perplexity evaluation on all three models takes about 5 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
108+
Running a full perplexity evaluation on all three models takes about 3 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
106109

107110
Create a script named ppl.sh:
108111

@@ -119,13 +122,13 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wik
119122
tail -f ppl.sh.log
120123
```
121124

122-
Here are the full results.
125+
| Model | Generation speed (batch size 1, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) | Perplexity Increase |
126+
|:-------:|:----------------------:|:------------:|:----------:|:----------------------:|
127+
| F16 | ~25 tokens per second | ~9 GB | 8.4612 +/- 0.06112 | 0 (baseline) |
128+
| Q8_0 | ~40 tokens per second | ~5 GB | 8.4776 +/- 0.06128 | +0.19% |
129+
| Q4_0 | ~60 tokens per second | ~3 GB | 9.1897 +/- 0.06604 | +8.6% |
123130

124-
| Model | Generation speed (tokens/s, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) |
125-
|:-------:|:----------------------:|:------------:|:----------:|
126-
| F16 | ~15–16 | ~15 GB | TODO |
127-
| Q8_0 | ~25 | ~8 GB | TODO |
128-
| Q4_0 | ~40 | ~4.4 GB | TODO |
131+
We can see that 8-bit quantization introduces negligible degradation. The 4-bit model does suffer more, but may still serve its purpose for simpler use cases. As always, you should run your own tests and make up your own mind.
129132

130133
When you have finished your benchmarking and evaluation, make sure to terminate your AWS EC2 instance in the AWS Management Console to avoid incurring unnecessary charges for unused compute resources.
131134

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ layout: learningpathall
99

1010
## Wrap up your AFM-4.5B deployment
1111

12-
Congratulations! You have completed the process of deploying the Arcee AFM-4.5B foundation model on AWS Graviton4.
12+
Congratulations! You have completed the process of deploying the Arcee [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on AWS Graviton4.
1313

1414
Here’s a summary of what you built and how you can take your knowledge forward.
1515

@@ -29,8 +29,8 @@ Using this Learning Path, you have:
2929

3030
The benchmarking results demonstrate the power of quantization and Arm-based computing:
3131

32-
- **Memory efficiency** – the 4-bit model uses only ~4.4 GB of RAM compared to ~15 GB for the full-precision version
33-
- **Speed improvements** – inference with Q4_0 is 2–3x faster (40+ tokens/sec vs. 15–16 tokens/sec)
32+
- **Memory efficiency** – the 4-bit model uses only ~3 GB of RAM compared to ~9 GB for the full-precision version
33+
- **Speed improvements** – inference with Q4_0 is 2.5x faster (~60+ tokens/sec vs. 25 tokens/sec)
3434
- **Cost optimization** – lower memory needs enable smaller, more affordable instances
3535
- **Quality preservation** – the quantized models maintain strong perplexity scores, showing minimal quality loss
3636

@@ -63,4 +63,4 @@ Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and G
6363

6464
From chatbots and content generation to research tools, this stack strikes a balance between performance, cost, and developer control.
6565

66-
For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit www.arcee.ai.
66+
For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit [www.arcee.ai](https://www.arcee.ai).

0 commit comments

Comments
 (0)