You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/00_overview.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,9 +8,9 @@ layout: learningpathall
8
8
9
9
## The AFM-4.5B model
10
10
11
-
AFM-4.5B is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 7 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.
11
+
[AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 8 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.
12
12
13
-
In this Learning Path, you'll deploy AFM-4.5B using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based AWS Graviton4 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.
13
+
In this Learning Path, you'll deploy [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based AWS Graviton4 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.
14
14
15
15
This hands-on guide helps developers build cost-efficient, high-performance LLM applications on modern Arm server infrastructure using open-source tools and real-world deployment practices.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md
+31-5Lines changed: 31 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,10 +6,19 @@ weight: 7
6
6
layout: learningpathall
7
7
---
8
8
9
-
In this step, you’ll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
9
+
In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
10
+
11
+
**Note: if you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) versions are available.**
10
12
11
13
Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on AWS Graviton4.
12
14
15
+
## Signing up to Hugging Face
16
+
17
+
In order to download AFM-4.5B, you will need:
18
+
- a Hugging Face account: you can sign up at [https://huggingface.co](https://huggingface.co)
19
+
- a read-only Hugging Face token: once logged in, you can create one at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Don't forget to store it, as you will only be able to view it once.
20
+
- to accept the terms of AFM-4.5B at [https://huggingface.co/arcee-ai/AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md
+24-23Lines changed: 24 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ weight: 8
6
6
layout: learningpathall
7
7
---
8
8
9
-
Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
9
+
Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
10
10
11
11
12
12
## Use llama-cli for interactive text generation
@@ -55,14 +55,15 @@ To exit the session, type `Ctrl+C` or `/bye`.
55
55
You'll then see performance metrics like this:
56
56
57
57
```bash
58
-
llama_perf_sampler_print: sampling time = 26.66 ms / 356 runs ( 0.07 ms per token, 13352.84 tokens per second)
59
-
llama_perf_context_print: load time = 782.72 ms
60
-
llama_perf_context_print: prompt evaltime = 392.40 ms / 24 tokens ( 16.35 ms per token, 61.16 tokens per second)
61
-
llama_perf_context_print: evaltime = 13173.66 ms / 331 runs ( 39.80 ms per token, 25.13 tokens per second)
62
-
llama_perf_context_print: total time = 129945.08 ms / 355 tokens
58
+
llama_perf_sampler_print: sampling time = 9.47 ms / 119 runs ( 0.08 ms per token, 12569.98 tokens per second)
59
+
llama_perf_context_print: load time = 616.69 ms
60
+
llama_perf_context_print: prompt evaltime = 344.39 ms / 23 tokens ( 14.97 ms per token, 66.79 tokens per second)
61
+
llama_perf_context_print: evaltime = 9289.81 ms / 352 runs ( 26.39 ms per token, 37.89 tokens per second)
62
+
llama_perf_context_print: total time = 17446.13 ms / 375 tokens
63
+
llama_perf_context_print: graphs reused = 0
63
64
```
64
65
65
-
In this example, the 8-bit model running on 16 threads generated 355 tokens, at ~25 tokens per second (`eval time`).
66
+
In this example, the 8-bit model running on 16 threads generated 375 tokens, at ~37 tokens per second (`eval time`).
66
67
67
68
## Run a non-interactive prompt
68
69
@@ -77,7 +78,7 @@ This command:
77
78
- Sends a one-time prompt using `-p`
78
79
- Prints the generated response and exits
79
80
80
-
The 4-bit model delivers faster generation—expect around 40 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance.
81
+
The 4-bit model delivers faster generation—expect around 60 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance.
81
82
82
83
## Use llama-server for API access
83
84
@@ -130,29 +131,29 @@ The response includes the model’s reply and performance metrics:
130
131
"index": 0,
131
132
"message": {
132
133
"role": "assistant",
133
-
"content": "Quantum computing uses quantum-mechanical phenomena, such as superposition and entanglement, to perform calculations. It allows for multiple possibilities to exist simultaneously, which can speed up certain processes. Unlike classical computers, quantum computers can solve complex problems and simulate systems more efficiently. Quantum bits (qubits) store information, and quantum gates perform operations. Quantum computing has potential applications in fields like cryptography, optimization, and materials science. Its development is an active area of research, with companies like IBM, Google, and Microsoft investing in quantum computing technology."
134
+
"content": "Quantum computing uses quantum-mechanical phenomena like superposition and entanglement to solve complex problems much faster than classical computers. Instead of binary bits (0 or 1), quantum bits (qubits) can exist in multiple states simultaneously, allowing for parallel processing of vast combinations of possibilities. This enables quantum computers to perform certain calculations exponentially faster, particularly in areas like cryptography, optimization, and drug discovery. However, quantum systems are fragile and prone to errors, requiring advanced error correction techniques. Current quantum computers are still in early stages but show promise for transformative applications."
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md
Even with just four threads, the Q4_0 model achieves comfortable generation speeds. On larger instances, you can run multiple concurrent model processes to support parallel workloads.
67
70
@@ -102,7 +105,7 @@ To reduce runtime, add the `--chunks` flag to evaluate a subset of the data. For
102
105
103
106
## Run the evaluation as a background script
104
107
105
-
Running a full perplexity evaluation on all three models takes about 5 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
108
+
Running a full perplexity evaluation on all three models takes about 3 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
We can see that 8-bit quantization introduces negligible degradation. The 4-bit model does suffer more, but may still serve its purpose for simpler use cases. As always, you should run your own tests and make up your own mind.
129
132
130
133
When you have finished your benchmarking and evaluation, make sure to terminate your AWS EC2 instance in the AWS Management Console to avoid incurring unnecessary charges for unused compute resources.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ layout: learningpathall
9
9
10
10
## Wrap up your AFM-4.5B deployment
11
11
12
-
Congratulations! You have completed the process of deploying the Arcee AFM-4.5B foundation model on AWS Graviton4.
12
+
Congratulations! You have completed the process of deploying the Arcee [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on AWS Graviton4.
13
13
14
14
Here’s a summary of what you built and how you can take your knowledge forward.
15
15
@@ -29,8 +29,8 @@ Using this Learning Path, you have:
29
29
30
30
The benchmarking results demonstrate the power of quantization and Arm-based computing:
31
31
32
-
-**Memory efficiency** – the 4-bit model uses only ~4.4 GB of RAM compared to ~15 GB for the full-precision version
33
-
-**Speed improvements** – inference with Q4_0 is 2–3x faster (40+ tokens/sec vs. 15–16 tokens/sec)
32
+
-**Memory efficiency** – the 4-bit model uses only ~3 GB of RAM compared to ~9 GB for the full-precision version
33
+
-**Speed improvements** – inference with Q4_0 is 2.5x faster (~60+ tokens/sec vs. 25 tokens/sec)
-**Quality preservation** – the quantized models maintain strong perplexity scores, showing minimal quality loss
36
36
@@ -63,4 +63,4 @@ Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and G
63
63
64
64
From chatbots and content generation to research tools, this stack strikes a balance between performance, cost, and developer control.
65
65
66
-
For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit www.arcee.ai.
66
+
For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit [www.arcee.ai](https://www.arcee.ai).
0 commit comments