You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/05_downloading_and_optimizing_afm45b.md
+22-21Lines changed: 22 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,38 +1,39 @@
1
1
---
2
-
title: Download and optimize the AFM-4.5B model
2
+
title: Download and optimize the AFM-4.5B model for Llama.cpp
3
3
weight: 7
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
9
+
In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with Llama.cpp, and generate quantized versions to optimize memory usage and inference speed.
10
10
11
-
**Note: if you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF)versions are available.**
11
+
**Note:** If you want to skip model optimization, pre-converted [GGUF versions](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) are available.
12
12
13
-
Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on Google Axion.
13
+
Make sure your Python virtual environment is activated before running commands. These instructions show you how to prepare AFM-4.5B for efficient inference on Google Cloud Axion Arm64 with Llama.cpp.
14
14
15
-
## Signing up to Hugging Face
15
+
## Sign up to Hugging Face
16
16
17
-
In order to download AFM-4.5B, you will need to:
18
-
- sign up for a Hugging Face account at [https://huggingface.co](https://huggingface.co)
19
-
- create a read-only Hugging Face token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Don't forget to store it, as you will only be able to view it once.
20
-
- accept the terms of AFM-4.5B at [https://huggingface.co/arcee-ai/AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
17
+
To download AFM-4.5B, you need to:
21
18
22
-
## Install the Hugging Face libraries
19
+
- Sign up for a Hugging Face account at [https://huggingface.co](https://huggingface.co)
20
+
- Create a read-only token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (store it securely; it is only shown once)
21
+
- Accept the model terms at [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
22
+
23
+
## Install Hugging Face libraries
23
24
24
25
```bash
25
26
pip install huggingface_hub hf_xet --upgrade
26
27
```
27
28
28
-
This command installs the most up to date versions of:
29
+
This installs:
29
30
30
-
-`huggingface_hub`: Python client for downloading models and datasets
31
-
-`hf_xet`: Git extension for fetching large model files stored on Hugging Face
31
+
-`huggingface_hub`: Python client for downloading models and datasets
32
+
-`hf_xet`: Git extension for fetching large model files from Hugging Face
32
33
33
-
These tools include the `hf`command-line interface you'll use next.
34
+
These tools include the `hf`CLI.
34
35
35
-
## Login to the Hugging Face Hub
36
+
## Log in to Hugging Face Hub
36
37
37
38
```bash
38
39
hf auth login
@@ -49,7 +50,7 @@ Enter your token (input will not be visible):
49
50
50
51
Please enter the token you created above, and answer 'n' to "Add token as git credential? (Y/n)".
@@ -90,7 +91,7 @@ This command creates a 4-bit quantized version of the model:
90
91
- The quantized model will use less memory and run faster, though with a small reduction in accuracy.
91
92
- The output file will be `afm-4-5B-Q4_0.gguf`.
92
93
93
-
## Arm optimization
94
+
## Arm optimizations for quantized models
94
95
95
96
Arm has contributed optimized kernels for Q4_0 that use Neoverse V2 instruction sets. These low-level routines accelerate math operations, delivering strong performance on Axion.
96
97
@@ -113,11 +114,11 @@ This command creates an 8-bit quantized version of the model:
113
114
114
115
Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse V2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.
115
116
116
-
## Model files ready for inference
117
+
## AFM-4.5B models ready for inference
117
118
118
119
After completing these steps, you'll have three versions of the AFM-4.5B model in `models/afm-4-5b`:
119
120
-`afm-4-5B-F16.gguf` - The original full-precision model (~15GB)
120
121
-`afm-4-5B-Q4_0.gguf` - 4-bit quantized version (~4.4GB) for memory-constrained environments
121
122
-`afm-4-5B-Q8_0.gguf` - 8-bit quantized version (~8GB) for balanced performance and memory usage
122
123
123
-
These models are now ready to be used with the `llama.cpp` inference engine for text generation and other language model tasks.
124
+
These models are now ready to use with the Llama.cpp inference engine on Google Cloud Axion Arm64.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/06_running_inference.md
+29-28Lines changed: 29 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,16 @@
1
1
---
2
-
title: Run inference with AFM-4.5B
2
+
title: Run inference with AFM-4.5B using Llama.cpp
3
3
weight: 8
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
9
+
Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference on Google Cloud Axion Arm64 using various Llama.cpp tools. In this step, you’ll generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
10
10
11
+
## Use llama-cli for interactive inference
11
12
12
-
## Use llama-cli for interactive text generation
13
+
The `llama-cli` tool provides an interactive command-line interface for text generation. This is useful for quick testing and exploring model behavior.
13
14
14
15
The `llama-cli` tool provides an interactive command-line interface for text generation. This is ideal for quick testing and hands-on exploration of the model's behavior.
15
16
@@ -19,14 +20,14 @@ The `llama-cli` tool provides an interactive command-line interface for text gen
-`-m` (model file path) specifies the model file to load
25
-
-`-n 256` sets the maximum number of tokens to generate per response
26
-
-`--color` enables colored terminal output
27
-
-The tool will prompt you to enter text, and the model will generate a response
25
+
-`-m`: specifies the model file to load
26
+
-`-n 256`: sets the maximum tokens per response
27
+
-`--color`: enables colored terminal output
28
+
-You’ll be prompted to enter text, and the model generates a response
28
29
29
-
In this example, `llama-cli` uses 16 vCPUs. You can try different values with `-t <number>`.
30
+
By default, `llama-cli` uses 16 vCPUs. You can change this with `-t <number>`.
30
31
31
32
### Example interactive session
32
33
@@ -63,28 +64,30 @@ llama_perf_context_print: total time = 17446.13 ms / 375 tokens
63
64
llama_perf_context_print: graphs reused = 0
64
65
```
65
66
66
-
In this example, the 8-bit model running on 16 threads generated 375 tokens, at ~37 tokens per second (`eval time`).
67
+
Here, the 8-bit model on 16 threads produced ~37 tokens per second.
67
68
68
-
## Run a non-interactive prompt
69
+
## Run a one-time prompt with llama-cli
69
70
70
-
You can also use `llama-cli` in one-shot mode with a prompt:
71
+
You can run `llama-cli` in non-interactive mode:
71
72
72
73
```bash
73
74
bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechanism in transformer models."
74
75
```
76
+
75
77
This command:
76
-
- Loads the 4-bit model
77
-
- Disables conversation mode using `-no-cnv`
78
-
- Sends a one-time prompt using `-p`
79
-
- Prints the generated response and exits
80
78
81
-
The 4-bit model delivers faster generation—expect around 60 tokens per second on Axion. This shows how a more aggressive quantization recipe helps deliver faster performance.
79
+
- Loads the 4-bit model
80
+
- Disables conversation mode with `-no-cnv`
81
+
- Sends a one-time prompt with `-p`
82
+
- Prints the response and exits
83
+
84
+
On Axion, the 4-bit model generates ~60 tokens per second, showing the speed benefit of aggressive quantization.
82
85
83
-
## Use llama-server for API access
86
+
## Use llama-server for API-based inference
84
87
85
-
The `llama-server` tool runs the model as a web server compatible with the OpenAI API format, allowing you to make HTTP requests for text generation. This is useful for integrating the model into applications or for batch processing.
88
+
The `llama-server` tool runs the model as a web server with an OpenAI-compatible API. This allows integration with applications or batch jobs via HTTP requests.
@@ -99,16 +102,14 @@ This starts a local server that:
99
102
- Accepts connections on port 8080
100
103
- Supports a 4096-token context window
101
104
102
-
### Make an API request
105
+
### Send an API request
103
106
104
107
Once the server is running, you can make requests using curl, or any HTTP client.
105
108
106
109
Open a new terminal on the Google Cloud instance, and run:
107
110
108
111
```bash
109
-
curl -X POST http://localhost:8080/v1/chat/completions \
110
-
-H "Content-Type: application/json" \
111
-
-d '{
112
+
curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
112
113
"model": "afm-4-5b",
113
114
"messages": [
114
115
{
@@ -162,8 +163,8 @@ The response includes the model’s reply and performance metrics:
162
163
163
164
You’ve now successfully:
164
165
165
-
- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and non-interactive modes
166
-
-Tested performance with different quantized models
167
-
- Served the model as an OpenAI-compatible API endpoint
166
+
- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and one-shot modes
167
+
-Compared performance with different quantized models on Axion
168
+
- Served the model as an OpenAI-compatible API endpoint
168
169
169
-
You can also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features.
170
+
You can also use the [OpenAI Python client](https://github.com/openai/openai-python) to send requests programmatically, enabling features like streaming responses.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/07_evaluating_the_quantized_models.md
+23-19Lines changed: 23 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,18 @@
1
1
---
2
-
title: Benchmark and evaluate the quantized models
2
+
title: Benchmark and evaluate AFM-4.5B quantized models on Axion
3
3
weight: 9
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Benchmark performance using llama-bench
9
+
## Benchmark AFM-4.5B performance with llama-bench
10
10
11
-
Use the [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool to measure model performance, including inference speed and memory usage.
11
+
Use the [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool to measure model performance on Google Cloud Axion Arm64, including inference speed and memory usage.
12
12
13
-
## Run basic benchmarks
13
+
## Benchmark full, 8-bit, and 4-bit models
14
14
15
-
Benchmark multiple model versions to compare performance:
@@ -68,10 +69,11 @@ Here’s an example of how performance scales across threads and prompt sizes (p
68
69
69
70
Even with just four threads, the Q4_0 model achieves comfortable generation speeds. On larger instances, you can run multiple concurrent model processes to support parallel workloads.
70
71
71
-
To benchmark batch inference, use [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench).
72
+
For batch inference, use [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench).
72
73
74
+
## Evaluate AFM-4.5B quality with llama-perplexity
73
75
74
-
## Evaluate model quality using llama-perplexity
76
+
Perplexity measures how well a model predicts text:
75
77
76
78
Use the llama-perplexity tool to measure how well each model predicts the next token in a sequence. Perplexity is a measure of how well a language model predicts text. It gives you insight into the model’s confidence and predictive ability, representing the average number of possible next tokens the model considers when predicting each word:
To reduce runtime, add the `--chunks` flag to evaluate a subset of the data. For example: `--chunks 50` runs the evaluation on the first 50 text blocks.
104
106
{{< /notice >}}
105
107
106
-
## Run the evaluation as a background script
108
+
## Run perplexity evaluation in the background
107
109
108
110
Running a full perplexity evaluation on all three models takes about 3 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
## Review your AFM-4.5B deployment on Google Cloud Axion
11
10
12
-
Congratulations! You have completed the process of deploying the Arcee [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on Google Axion.
11
+
Congratulations! You have successfully deployed the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on Google Cloud Axion Arm64.
13
12
14
-
Here’s a summary of what you built and how you can take your knowledge forward.
13
+
Here’s a summary of what you built and how to extend it.
15
14
16
15
Using this Learning Path, you have:
17
16
@@ -34,33 +33,33 @@ The benchmarking results demonstrate the power of quantization and Arm-based com
-**Quality preservation** – the quantized models maintain strong perplexity scores, showing minimal quality loss
36
35
37
-
## The Google Axion advantage
36
+
## Benefits of Google Cloud Axion Arm64
38
37
39
-
Google Axion processors, built on the Arm Neoverse V2 architecture, provide:
38
+
Google Cloud Axion processors, based on Arm Neoverse V2, provide:
40
39
41
-
-Superior performance per watt compared to x86 alternatives
42
-
-Cost savings of 20–40% for compute-intensive workloads
43
-
- Optimized memory bandwidth and cache hierarchy for AI/ML workloads
44
-
- Native Arm64 support for modern machine learning frameworks
40
+
-Better performance per watt than x86 alternatives
41
+
- 20–40% cost savings for compute-intensive workloads
42
+
- Optimized memory bandwidth and cache hierarchy for ML tasks
43
+
- Native Arm64 support for modern machine learning frameworks
45
44
46
-
## Next steps for deploying AFM-4.5B on Arm
45
+
## Next steps with AFM-4.5B on Axion
47
46
48
-
Now that you have a fully functional AFM-4.5B deployment, here are some ways to extend your learning:
47
+
Now that you have a working deployment, you can extend it further.
49
48
50
-
**Production deployment**:
51
-
-Set up auto-scaling groups for high availability
52
-
- Implement load balancing for multiple model instances
53
-
-Add monitoring and logging with CloudWatch
54
-
- Secure your API endpoints with proper authentication
49
+
**Production deployment**:
50
+
-Add auto-scaling for high availability
51
+
- Implement load balancing for multiple instances
52
+
-Enable monitoring and logging with CloudWatch
53
+
- Secure API endpoints with authentication
55
54
56
-
**Application development**:
57
-
- Build a web application using the `llama-server` API
58
-
- Create a chatbot or virtual assistant
59
-
- Develop content generation tools
60
-
- Integrate with existing applications via REST APIs
55
+
**Application development**:
56
+
- Build a web app with the `llama-server` API
57
+
- Create a chatbot or assistant
58
+
- Develop content generation tools
59
+
- Integrate AFM-4.5B into existing apps via REST APIs
61
60
62
-
Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Axion's compute capabilities give you everything you need to build scalable, production-grade AI applications.
61
+
Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Google Cloud Axion provide a scalable, cost-efficient platform for AI.
63
62
64
-
From chatbots and content generation to research tools, this stack strikes a balance between performance, cost, and developer control.
63
+
From chatbots and content generation to research tools, this stack delivers a balance of performance, cost, and developer control.
65
64
66
65
For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit [www.arcee.ai](https://www.arcee.ai).
0 commit comments