Skip to content

Commit dc6c7c4

Browse files
Content dev
1 parent b36a68b commit dc6c7c4

File tree

4 files changed

+98
-93
lines changed

4 files changed

+98
-93
lines changed

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/05_downloading_and_optimizing_afm45b.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,39 @@
11
---
2-
title: Download and optimize the AFM-4.5B model
2+
title: Download and optimize the AFM-4.5B model for Llama.cpp
33
weight: 7
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.
9+
In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with Llama.cpp, and generate quantized versions to optimize memory usage and inference speed.
1010

11-
**Note: if you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) versions are available.**
11+
**Note:** If you want to skip model optimization, pre-converted [GGUF versions](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) are available.
1212

13-
Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on Google Axion.
13+
Make sure your Python virtual environment is activated before running commands. These instructions show you how to prepare AFM-4.5B for efficient inference on Google Cloud Axion Arm64 with Llama.cpp.
1414

15-
## Signing up to Hugging Face
15+
## Sign up to Hugging Face
1616

17-
In order to download AFM-4.5B, you will need to:
18-
- sign up for a Hugging Face account at [https://huggingface.co](https://huggingface.co)
19-
- create a read-only Hugging Face token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Don't forget to store it, as you will only be able to view it once.
20-
- accept the terms of AFM-4.5B at [https://huggingface.co/arcee-ai/AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
17+
To download AFM-4.5B, you need to:
2118

22-
## Install the Hugging Face libraries
19+
- Sign up for a Hugging Face account at [https://huggingface.co](https://huggingface.co)
20+
- Create a read-only token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (store it securely; it is only shown once)
21+
- Accept the model terms at [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B)
22+
23+
## Install Hugging Face libraries
2324

2425
```bash
2526
pip install huggingface_hub hf_xet --upgrade
2627
```
2728

28-
This command installs the most up to date versions of:
29+
This installs:
2930

30-
- `huggingface_hub`: Python client for downloading models and datasets
31-
- `hf_xet`: Git extension for fetching large model files stored on Hugging Face
31+
- `huggingface_hub`: Python client for downloading models and datasets
32+
- `hf_xet`: Git extension for fetching large model files from Hugging Face
3233

33-
These tools include the `hf` command-line interface you'll use next.
34+
These tools include the `hf` CLI.
3435

35-
## Login to the Hugging Face Hub
36+
## Log in to Hugging Face Hub
3637

3738
```bash
3839
hf auth login
@@ -49,7 +50,7 @@ Enter your token (input will not be visible):
4950

5051
Please enter the token you created above, and answer 'n' to "Add token as git credential? (Y/n)".
5152

52-
## Download the AFM-4.5B model
53+
## Download AFM-4.5B from Hugging Face
5354

5455
```bash
5556
hf download arcee-ai/afm-4.5B --local-dir models/afm-4-5b/
@@ -60,7 +61,7 @@ This command downloads the model to the `models/afm-4-5b` directory:
6061
- The download includes the model weights, configuration files, and tokenizer data.
6162
- This is a 4.5 billion parameter model, so the download can take several minutes depending on your internet connection.
6263

63-
## Convert to GGUF format
64+
## Convert AFM-4.5B to GGUF format
6465

6566
```bash
6667
python3 convert_hf_to_gguf.py models/afm-4-5b
@@ -76,7 +77,7 @@ This command converts the downloaded Hugging Face model to GGUF (GGML Universal
7677

7778
Next, deactivate the Python virtual environment as future commands won't require it.
7879

79-
## Create Q4_0 Quantized Version
80+
## Create a Q4_0 quantized version
8081

8182
```bash
8283
bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q4_0.gguf Q4_0
@@ -90,7 +91,7 @@ This command creates a 4-bit quantized version of the model:
9091
- The quantized model will use less memory and run faster, though with a small reduction in accuracy.
9192
- The output file will be `afm-4-5B-Q4_0.gguf`.
9293

93-
## Arm optimization
94+
## Arm optimizations for quantized models
9495

9596
Arm has contributed optimized kernels for Q4_0 that use Neoverse V2 instruction sets. These low-level routines accelerate math operations, delivering strong performance on Axion.
9697

@@ -113,11 +114,11 @@ This command creates an 8-bit quantized version of the model:
113114

114115
Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse V2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.
115116

116-
## Model files ready for inference
117+
## AFM-4.5B models ready for inference
117118

118119
After completing these steps, you'll have three versions of the AFM-4.5B model in `models/afm-4-5b`:
119120
- `afm-4-5B-F16.gguf` - The original full-precision model (~15GB)
120121
- `afm-4-5B-Q4_0.gguf` - 4-bit quantized version (~4.4GB) for memory-constrained environments
121122
- `afm-4-5B-Q8_0.gguf` - 8-bit quantized version (~8GB) for balanced performance and memory usage
122123

123-
These models are now ready to be used with the `llama.cpp` inference engine for text generation and other language model tasks.
124+
These models are now ready to use with the Llama.cpp inference engine on Google Cloud Axion Arm64.

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/06_running_inference.md

Lines changed: 29 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
---
2-
title: Run inference with AFM-4.5B
2+
title: Run inference with AFM-4.5B using Llama.cpp
33
weight: 8
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
9+
Now that you have the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) models in GGUF format, you can run inference on Google Cloud Axion Arm64 using various Llama.cpp tools. In this step, you’ll generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.
1010

11+
## Use llama-cli for interactive inference
1112

12-
## Use llama-cli for interactive text generation
13+
The `llama-cli` tool provides an interactive command-line interface for text generation. This is useful for quick testing and exploring model behavior.
1314

1415
The `llama-cli` tool provides an interactive command-line interface for text generation. This is ideal for quick testing and hands-on exploration of the model's behavior.
1516

@@ -19,14 +20,14 @@ The `llama-cli` tool provides an interactive command-line interface for text gen
1920
bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -n 256 --color
2021
```
2122

22-
This command starts an interactive session:
23+
This starts an interactive session:
2324

24-
- `-m` (model file path) specifies the model file to load
25-
- `-n 256` sets the maximum number of tokens to generate per response
26-
- `--color` enables colored terminal output
27-
- The tool will prompt you to enter text, and the model will generate a response
25+
- `-m`: specifies the model file to load
26+
- `-n 256`: sets the maximum tokens per response
27+
- `--color`: enables colored terminal output
28+
- You’ll be prompted to enter text, and the model generates a response
2829

29-
In this example, `llama-cli` uses 16 vCPUs. You can try different values with `-t <number>`.
30+
By default, `llama-cli` uses 16 vCPUs. You can change this with `-t <number>`.
3031

3132
### Example interactive session
3233

@@ -63,28 +64,30 @@ llama_perf_context_print: total time = 17446.13 ms / 375 tokens
6364
llama_perf_context_print: graphs reused = 0
6465
```
6566

66-
In this example, the 8-bit model running on 16 threads generated 375 tokens, at ~37 tokens per second (`eval time`).
67+
Here, the 8-bit model on 16 threads produced ~37 tokens per second.
6768

68-
## Run a non-interactive prompt
69+
## Run a one-time prompt with llama-cli
6970

70-
You can also use `llama-cli` in one-shot mode with a prompt:
71+
You can run `llama-cli` in non-interactive mode:
7172

7273
```bash
7374
bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechanism in transformer models."
7475
```
76+
7577
This command:
76-
- Loads the 4-bit model
77-
- Disables conversation mode using `-no-cnv`
78-
- Sends a one-time prompt using `-p`
79-
- Prints the generated response and exits
8078

81-
The 4-bit model delivers faster generation—expect around 60 tokens per second on Axion. This shows how a more aggressive quantization recipe helps deliver faster performance.
79+
- Loads the 4-bit model
80+
- Disables conversation mode with `-no-cnv`
81+
- Sends a one-time prompt with `-p`
82+
- Prints the response and exits
83+
84+
On Axion, the 4-bit model generates ~60 tokens per second, showing the speed benefit of aggressive quantization.
8285

83-
## Use llama-server for API access
86+
## Use llama-server for API-based inference
8487

85-
The `llama-server` tool runs the model as a web server compatible with the OpenAI API format, allowing you to make HTTP requests for text generation. This is useful for integrating the model into applications or for batch processing.
88+
The `llama-server` tool runs the model as a web server with an OpenAI-compatible API. This allows integration with applications or batch jobs via HTTP requests.
8689

87-
## Start the server
90+
### Start llama-server
8891

8992
```bash
9093
bin/llama-server -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \
@@ -99,16 +102,14 @@ This starts a local server that:
99102
- Accepts connections on port 8080
100103
- Supports a 4096-token context window
101104

102-
### Make an API request
105+
### Send an API request
103106

104107
Once the server is running, you can make requests using curl, or any HTTP client.
105108

106109
Open a new terminal on the Google Cloud instance, and run:
107110

108111
```bash
109-
curl -X POST http://localhost:8080/v1/chat/completions \
110-
-H "Content-Type: application/json" \
111-
-d '{
112+
curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
112113
"model": "afm-4-5b",
113114
"messages": [
114115
{
@@ -162,8 +163,8 @@ The response includes the model’s reply and performance metrics:
162163

163164
You’ve now successfully:
164165

165-
- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and non-interactive modes
166-
- Tested performance with different quantized models
167-
- Served the model as an OpenAI-compatible API endpoint
166+
- Run [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) in interactive and one-shot modes
167+
- Compared performance with different quantized models on Axion
168+
- Served the model as an OpenAI-compatible API endpoint
168169

169-
You can also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features.
170+
You can also use the [OpenAI Python client](https://github.com/openai/openai-python) to send requests programmatically, enabling features like streaming responses.

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/07_evaluating_the_quantized_models.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
---
2-
title: Benchmark and evaluate the quantized models
2+
title: Benchmark and evaluate AFM-4.5B quantized models on Axion
33
weight: 9
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Benchmark performance using llama-bench
9+
## Benchmark AFM-4.5B performance with llama-bench
1010

11-
Use the [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool to measure model performance, including inference speed and memory usage.
11+
Use the [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool to measure model performance on Google Cloud Axion Arm64, including inference speed and memory usage.
1212

13-
## Run basic benchmarks
13+
## Benchmark full, 8-bit, and 4-bit models
1414

15-
Benchmark multiple model versions to compare performance:
15+
Run benchmarks on multiple versions of AFM-4.5B:
1616

1717
```bash
1818
# Benchmark the full-precision model
@@ -25,16 +25,17 @@ bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q8_0.gguf
2525
bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf
2626
```
2727

28-
Typical results on a 16 vCPU instance:
29-
- **F16 model**: ~25 tokens/second, ~9GB memory usage
30-
- **Q8_0 model**: ~40 tokens/second, ~5GB memory usage
31-
- **Q4_0 model**: ~60 tokens/second, ~3GB memory usage
28+
Typical results on a 16 vCPU Axion instance:
3229

33-
Your actual results might vary depending on your specific instance configuration and system load.
30+
- **F16 model**: ~25 tokens/sec, ~9GB memory
31+
- **Q8_0 model**: ~40 tokens/sec, ~5GB memory
32+
- **Q4_0 model**: ~60 tokens/sec, ~3GB memory
3433

35-
## Run advanced benchmarks
34+
Results vary depending on system configuration and load.
3635

37-
Use this command to benchmark performance across prompt sizes and thread counts:
36+
## Run advanced benchmarks with threads and prompts
37+
38+
Benchmark across prompt sizes and thread counts:
3839

3940
```bash
4041
bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \
@@ -68,10 +69,11 @@ Here’s an example of how performance scales across threads and prompt sizes (p
6869

6970
Even with just four threads, the Q4_0 model achieves comfortable generation speeds. On larger instances, you can run multiple concurrent model processes to support parallel workloads.
7071

71-
To benchmark batch inference, use [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench).
72+
For batch inference, use [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench).
7273

74+
## Evaluate AFM-4.5B quality with llama-perplexity
7375

74-
## Evaluate model quality using llama-perplexity
76+
Perplexity measures how well a model predicts text:
7577

7678
Use the llama-perplexity tool to measure how well each model predicts the next token in a sequence. Perplexity is a measure of how well a language model predicts text. It gives you insight into the model’s confidence and predictive ability, representing the average number of possible next tokens the model considers when predicting each word:
7779

@@ -103,24 +105,26 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wik
103105
To reduce runtime, add the `--chunks` flag to evaluate a subset of the data. For example: `--chunks 50` runs the evaluation on the first 50 text blocks.
104106
{{< /notice >}}
105107

106-
## Run the evaluation as a background script
108+
## Run perplexity evaluation in the background
107109

108110
Running a full perplexity evaluation on all three models takes about 3 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background.
109111

110112
Create a script named ppl.sh:
111113

112-
For example:
113114
```bash
114115
#!/bin/bash
115116
# ppl.sh
116117
bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-F16.gguf -f wikitext-2-raw/wiki.test.raw
117118
bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw
118119
bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wiki.test.raw
119120
```
121+
122+
Run it:
123+
120124
```bash
121-
nohup sh ppl.sh >& ppl.sh.log &
122-
tail -f ppl.sh.log
123-
```
125+
nohup sh ppl.sh >& ppl.sh.log &
126+
tail -f ppl.sh.log
127+
```
124128

125129
| Model | Generation speed (batch size 1, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) | Perplexity Increase |
126130
|:-------:|:----------------------:|:------------:|:----------:|:----------------------:|
Lines changed: 24 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
11
---
2-
title: Review what you built
2+
title: Review your AFM-4.5B deployment on Axion
33
weight: 10
44

5-
65
### FIXED, DO NOT MODIFY
76
layout: learningpathall
87
---
98

10-
## Wrap up your AFM-4.5B deployment
9+
## Review your AFM-4.5B deployment on Google Cloud Axion
1110

12-
Congratulations! You have completed the process of deploying the Arcee [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on Google Axion.
11+
Congratulations! You have successfully deployed the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) foundation model on Google Cloud Axion Arm64.
1312

14-
Here’s a summary of what you built and how you can take your knowledge forward.
13+
Here’s a summary of what you built and how to extend it.
1514

1615
Using this Learning Path, you have:
1716

@@ -34,33 +33,33 @@ The benchmarking results demonstrate the power of quantization and Arm-based com
3433
- **Cost optimization** – lower memory needs enable smaller, more affordable instances
3534
- **Quality preservation** – the quantized models maintain strong perplexity scores, showing minimal quality loss
3635

37-
## The Google Axion advantage
36+
## Benefits of Google Cloud Axion Arm64
3837

39-
Google Axion processors, built on the Arm Neoverse V2 architecture, provide:
38+
Google Cloud Axion processors, based on Arm Neoverse V2, provide:
4039

41-
- Superior performance per watt compared to x86 alternatives
42-
- Cost savings of 20–40% for compute-intensive workloads
43-
- Optimized memory bandwidth and cache hierarchy for AI/ML workloads
44-
- Native Arm64 support for modern machine learning frameworks
40+
- Better performance per watt than x86 alternatives
41+
- 20–40% cost savings for compute-intensive workloads
42+
- Optimized memory bandwidth and cache hierarchy for ML tasks
43+
- Native Arm64 support for modern machine learning frameworks
4544

46-
## Next steps for deploying AFM-4.5B on Arm
45+
## Next steps with AFM-4.5B on Axion
4746

48-
Now that you have a fully functional AFM-4.5B deployment, here are some ways to extend your learning:
47+
Now that you have a working deployment, you can extend it further.
4948

50-
**Production deployment**:
51-
- Set up auto-scaling groups for high availability
52-
- Implement load balancing for multiple model instances
53-
- Add monitoring and logging with CloudWatch
54-
- Secure your API endpoints with proper authentication
49+
**Production deployment**:
50+
- Add auto-scaling for high availability
51+
- Implement load balancing for multiple instances
52+
- Enable monitoring and logging with CloudWatch
53+
- Secure API endpoints with authentication
5554

56-
**Application development**:
57-
- Build a web application using the `llama-server` API
58-
- Create a chatbot or virtual assistant
59-
- Develop content generation tools
60-
- Integrate with existing applications via REST APIs
55+
**Application development**:
56+
- Build a web app with the `llama-server` API
57+
- Create a chatbot or assistant
58+
- Develop content generation tools
59+
- Integrate AFM-4.5B into existing apps via REST APIs
6160

62-
Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Axion's compute capabilities give you everything you need to build scalable, production-grade AI applications.
61+
Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Google Cloud Axion provide a scalable, cost-efficient platform for AI.
6362

64-
From chatbots and content generation to research tools, this stack strikes a balance between performance, cost, and developer control.
63+
From chatbots and content generation to research tools, this stack delivers a balance of performance, cost, and developer control.
6564

6665
For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit [www.arcee.ai](https://www.arcee.ai).

0 commit comments

Comments
 (0)