Skip to content

Commit 360e37f

Browse files
committed
First tech review of Arcee Learning Path
1 parent e332fe3 commit 360e37f

File tree

9 files changed

+64
-68
lines changed

9 files changed

+64
-68
lines changed

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/01_launching_a_graviton4_instance.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,13 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## System Requirements
9+
## Requirements
1010

1111
- An AWS account
1212

13-
- Quota for c8g instances in your preferred region
13+
- Access to launch an EC2 instance of type `c8g.4xlarge` (or larger) with at least 128 GB of storage
1414

15-
- A Linux or MacOS host
16-
17-
- A c8g instance (4xlarge or larger)
18-
19-
- At least 128GB of storage
15+
For more information about creating an EC2 instance using AWS refer to [Getting Started with AWS](/learning-paths/servers-and-cloud-computing/csp/aws/).
2016

2117
## AWS Console Steps
2218

@@ -49,12 +45,14 @@ Follow these steps to launch your EC2 instance using the AWS Management Console:
4945
3. **Secure the Key File**
5046

5147
- Move the downloaded `.pem` file to the SSH configuration directory
48+
5249
```bash
5350
mkdir -p ~/.ssh
5451
mv arcee-graviton4-key.pem ~/.ssh
5552
```
5653

57-
- Set proper permissions (on Mac/Linux):
54+
- Set proper permissions on macOS or Linux:
55+
5856
```bash
5957
chmod 400 ~/.ssh/arcee-graviton4-key.pem
6058
```
@@ -105,9 +103,12 @@ Follow these steps to launch your EC2 instance using the AWS Management Console:
105103

106104
- In the dropdown list, select "My IP".
107105
108-
Note 1: you will only be able to connect to the instance from your current host, which is the safest setting. We don't recommend selecting "Anywhere", which would allow anyone on the Internet to attempt to connect. Use at your own risk.
109106
110-
Note 2: although this demonstration only requires SSH access, feel free to use one of your existing security groups as long as it allows SSH traffic.
107+
{{% notice Notes %}}
108+
You will only be able to connect to the instance from your current host, which is the safest setting. Selecting "Anywhere" allows anyone on the Internet to attempt to connect; use at your own risk.
109+
110+
Although this demonstration only requires SSH access, it is possible to use one of your existing security groups as long as it allows SSH traffic.
111+
{{% /notice %}}
111112
112113
5. **Configure Storage**
113114
@@ -161,7 +162,7 @@ Follow these steps to launch your EC2 instance using the AWS Management Console:
161162
162163
- **AMI Selection**: The Ubuntu 24.04 LTS AMI must be ARM64 compatible for Graviton processors
163164
164-
- **Security**: please think twice about allowing SSH from anywhere (0.0.0.0/0). We strongly recommend restricting access to your IP address
165+
- **Security**: Think twice about allowing SSH from anywhere (0.0.0.0/0). It is strongly recommended to restrict access to your IP address.
165166
166167
- **Storage**: The 128GB EBS volume is sufficient for the Arcee model and dependencies
167168

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/02_setting_up_the_instance.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ weight: 3
66
layout: learningpathall
77
---
88

9-
In this step, we'll set up the Graviton4 instance with all the necessary tools and dependencies required to build and run the Arcee Foundation Model. This includes installing the build tools and Python environment.
9+
In this step, you'll set up the Graviton4 instance with all the necessary tools and dependencies required to build and run the Arcee Foundation Model. This includes installing the build tools and Python environment.
1010

1111
## Step 1: Update Package List
1212

@@ -29,7 +29,7 @@ sudo apt-get install cmake gcc g++ git python3 python3-pip python3-virtualenv li
2929

3030
This command installs all the essential development tools and dependencies:
3131

32-
- **cmake**: Cross-platform build system generator that we'll use to compile Llama.cpp
32+
- **cmake**: Cross-platform build system generator used to compile Llama.cpp
3333
- **gcc & g++**: GNU C and C++ compilers for building native code
3434
- **git**: Version control system for cloning repositories
3535
- **python3**: Python interpreter for running Python-based tools and scripts
@@ -39,9 +39,9 @@ This command installs all the essential development tools and dependencies:
3939

4040
The `-y` flag automatically answers "yes" to prompts, making the installation non-interactive.
4141

42-
## What's Ready Now
42+
## What's Ready Now?
4343

44-
After completing these steps, your Graviton4 instance will have:
44+
After completing these steps, your Graviton4 instance has:
4545

4646
- A complete C/C++ development environment for building Llama.cpp
4747
- Python 3 with pip for managing Python packages

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/03_building_llama_cpp.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,9 @@ weight: 4
66
layout: learningpathall
77
---
88

9-
In this step, we'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model that's optimized for inference on various hardware platforms, including ARM-based processors like Graviton4.
9+
In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model that's optimized for inference on various hardware platforms, including Arm-based processors like Graviton4.
1010

11-
Even though AFM-4.5B has a custom model architecture, we're able to use the vanilla version of llama.cpp as the Arcee AI team has contributed the appropriate modeling code.
12-
13-
Here are all the steps.
11+
Even though AFM-4.5B has a custom model architecture, we're able to use the vanilla version of Llama.cpp as the Arcee AI team has contributed the appropriate modeling code.
1412

1513
## Step 1: Clone the Repository
1614

@@ -26,7 +24,7 @@ This command clones the Llama.cpp repository from GitHub to your local machine.
2624
cd llama.cpp
2725
```
2826

29-
Change into the llama.cpp directory where we'll perform the build process. This directory contains the CMakeLists.txt file and source code structure.
27+
Change into the llama.cpp directory to run the build process. This directory contains the `CMakeLists.txt` file and source code structure.
3028

3129
## Step 3: Configure the Build with CMake
3230

@@ -35,13 +33,15 @@ cmake -B .
3533
```
3634

3735
This command uses CMake to configure the build system:
36+
3837
- `-B .` specifies that the build files should be generated in the current directory
3938
- CMake will detect your system's compiler, libraries, and hardware capabilities
4039
- It will generate the appropriate build files (Makefiles on Linux) based on your system configuration
4140

42-
Note: The cmake output should include the information below, indicating that the build process will leverage the Neoverse V2 architecture's specialized instruction sets designed for AI/ML workloads. These optimizations are crucial for achieving optimal performance on Graviton4:
4341

44-
```bash
42+
The CMake output should include the information below, indicating that the build process will leverage the Neoverse V2 architecture's specialized instruction sets designed for AI/ML workloads. These optimizations are crucial for achieving optimal performance on Graviton4:
43+
44+
```output
4545
-- ARM feature DOTPROD enabled
4646
-- ARM feature SVE enabled
4747
-- ARM feature MATMUL_INT8 enabled
@@ -69,7 +69,7 @@ This command compiles the Llama.cpp project:
6969

7070
The build process will compile the C++ source code into executable binaries optimized for your ARM64 architecture. This should only take a minute.
7171

72-
## What Gets Built
72+
## What is built?
7373

7474
After successful compilation, you'll have several key command-line executables in the `bin` directory:
7575
- `llama-cli` - The main inference executable for running LLaMA models

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/04_install_python_dependencies_for_llama_cpp.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,7 @@ weight: 5
66
layout: learningpathall
77
---
88

9-
In this step, we'll set up a Python virtual environment and install the required dependencies for working with Llama.cpp. This ensures we have a clean, isolated Python environment with all the necessary packages for model optimization.
10-
11-
Here are all the steps.
9+
In this step, you'll set up a Python virtual environment and install the required dependencies for working with Llama.cpp. This ensures you have a clean, isolated Python environment with all the necessary packages for model optimization.
1210

1311
## Step 1: Create a Python Virtual Environment
1412

@@ -30,14 +28,14 @@ source env-llama-cpp/bin/activate
3028

3129
This command activates the virtual environment:
3230
- The `source` command executes the activation script, which modifies your current shell environment
33-
- Depending on you sheel, your command prompt may change to show `(env-llama-cpp)` at the beginning, indicating the active environment. We will reflect this in the following commands.
31+
- Depending on you sheel, your command prompt may change to show `(env-llama-cpp)` at the beginning, indicating the active environment. This will be reflected in the following commands.
3432
- All subsequent `pip` commands will install packages into this isolated environment
3533
- The `PATH` environment variable is updated to prioritize the virtual environment's Python interpreter
3634

3735
## Step 3: Upgrade pip to the Latest Version
3836

3937
```bash
40-
(env-llama-cpp) pip install --upgrade pip
38+
pip install --upgrade pip
4139
```
4240

4341
This command ensures you have the latest version of pip:
@@ -49,7 +47,7 @@ This command ensures you have the latest version of pip:
4947
## Step 4: Install Project Dependencies
5048

5149
```bash
52-
(env-llama-cpp) pip install -r requirements.txt
50+
pip install -r requirements.txt
5351
```
5452

5553
This command installs all the Python packages specified in the requirements.txt file:
@@ -58,7 +56,7 @@ This command installs all the Python packages specified in the requirements.txt
5856
- This ensures everyone working on the project uses the same package versions
5957
- The installation will include packages needed for model loading, inference, and any Python bindings for Llama.cpp
6058

61-
## What Gets Installed
59+
## What is installed?
6260

6361
After successful installation, your virtual environment will contain:
6462
- **NumPy**: For numerical computations and array operations

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,24 @@ weight: 6
66
layout: learningpathall
77
---
88

9-
In this step, we'll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for use with Llama.cpp, and create quantized versions to optimize memory usage and inference speed.
9+
In this step, you'll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for use with Llama.cpp, and create quantized versions to optimize memory usage and inference speed.
1010

1111
The first release of the [Arcee Foundation Model](https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family) family, [AFM-4.5B](https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model) is a 4.5-billion-parameter frontier model that delivers excellent accuracy, strict compliance, and very high cost-efficiency. It was trained on almost 7 trillion tokens of clean, rigorously filtered data, and has been tested across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish
1212

13-
Here are all the steps to download and optimize the model for AWS Graviton4. Make sure to run them in the virtual environment you created at the previous step.
13+
Here are the steps to download and optimize the model for AWS Graviton4. Make sure to run them in the virtual environment you created at the previous step.
1414

1515
## Step 1: Install the Hugging Face libraries
1616

1717
```bash
18-
(env-llama-cpp) pip install huggingface_hub hf_xet
18+
pip install huggingface_hub hf_xet
1919
```
2020

21-
This command installs the Hugging Face Hub Python library, which provides tools for downloading models and datasets from the Hugging Face platform. The library includes the `huggingface-cli` command-line interface that we'll use to download the AFM-4.5B model. The `hf_xet` library provides additional functionality for efficient data transfer and caching when downloading large models from Hugging Face Hub.
21+
This command installs the Hugging Face Hub Python library, which provides tools for downloading models and datasets from the Hugging Face platform. The library includes the `huggingface-cli` command-line interface that you can use to download the AFM-4.5B model.
2222

2323
## Step 2: Download the AFM-4.5B Model
2424

2525
```bash
26-
(env-llama-cpp) huggingface-cli download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
26+
huggingface-cli download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
2727
```
2828

2929
This command downloads the AFM-4.5B model from the Hugging Face Hub:
@@ -35,8 +35,8 @@ This command downloads the AFM-4.5B model from the Hugging Face Hub:
3535
## Step 3: Convert to GGUF Format
3636

3737
```bash
38-
(env-llama-cpp) python3 convert_hf_to_gguf.py models/afm-4-5b
39-
(env-llama-cpp) deactivate
38+
python3 convert_hf_to_gguf.py models/afm-4-5b
39+
deactivate
4040
```
4141

4242
The first command converts the downloaded Hugging Face model to the GGUF (GGML Universal Format) format:
@@ -46,7 +46,7 @@ The first command converts the downloaded Hugging Face model to the GGUF (GGML U
4646
- It outputs a single `afm-4-5B-F16.gguf` ~15GB file in the `models/afm-4-5b/` directory
4747
- GGUF is the native format used by Llama.cpp and provides efficient loading and inference
4848

49-
Then, we deactivate the Python virtual environment as future commands won't require it.
49+
Next, deactivate the Python virtual environment as future commands won't require it.
5050

5151
## Step 4: Create Q4_0 Quantized Version
5252

@@ -81,7 +81,7 @@ This command creates an 8-bit quantized version of the model:
8181

8282
**ARM Optimization**: Similar to Q4_0, ARM has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse v2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.
8383

84-
## What You'll Have
84+
## What is available now?
8585

8686
After completing these steps, you'll have three versions of the AFM-4.5B model:
8787
- `afm-4-5B-F16.gguf` - The original full-precision model (~15GB)

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ weight: 7
66
layout: learningpathall
77
---
88

9-
Now that we have our AFM-4.5B models in GGUF format, we can run inference using various Llama.cpp tools. In this step, we'll explore different ways to interact with the model for text generation, benchmarking, and evaluation.
9+
Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore different ways to interact with the model for text generation, benchmarking, and evaluation.
1010

1111
## Using llama-cli for Interactive Text Generation
1212

@@ -19,6 +19,7 @@ bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -n 256 --color
1919
```
2020

2121
This command starts an interactive session with the model:
22+
2223
- `-m models/afm-4-5b/afm-4-5B-Q8_0.gguf` specifies the model file to load
2324
- `-n 512` sets the maximum number of tokens to generate per response
2425
- The tool will prompt you to enter text, and the model will generate a response
@@ -29,7 +30,7 @@ In this example, `llama-cli` uses 16 vCPUs. You can try different values with `-
2930

3031
Once you start the interactive session, you can have conversations like this:
3132

32-
```
33+
```console
3334
> Give me a brief explanation of the attention mechnanism in transformer models.
3435
In transformer models, the attention mechanism allows the model to focus on specific parts of the input sequence when computing the output. Here's a simplified explanation:
3536

@@ -50,6 +51,7 @@ The attention mechanism allows transformer models to selectively focus on specif
5051
To exit the interactive session, type `Ctrl+C` or `/bye`.
5152

5253
This will display performance statistics:
54+
5355
```bash
5456
llama_perf_sampler_print: sampling time = 26.66 ms / 356 runs ( 0.07 ms per token, 13352.84 tokens per second)
5557
llama_perf_context_print: load time = 782.72 ms
@@ -62,7 +64,7 @@ In this example, our 8-bit model running on 16 threads generated 355 tokens, at
6264

6365
### Example Non-Interactive Session
6466

65-
Now, let's try the 4-bit model in non-interactive mode:
67+
Now, try the 4-bit model in non-interactive mode:
6668

6769
```bash
6870
bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechnanism in transformer models."
@@ -116,7 +118,7 @@ curl -X POST http://localhost:8080/v1/chat/completions \
116118
}'
117119
```
118120

119-
You should get an answer similar to this one:
121+
You get an answer similar to this one:
120122

121123
```json
122124
{
@@ -153,4 +155,4 @@ You should get an answer similar to this one:
153155
}
154156
```
155157

156-
You could also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features.
158+
You can also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features.

content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ The results should look like this:
6363

6464
It's pretty amazing to see that with only 4 threads, the 4-bit model can still generate at the very comfortable speed of 15 tokens per second. We could definitely run several copies of the model on the same instance to serve concurrent users or applications.
6565

66-
You could also try [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench) to benchmark performance on batch sizes larger than 1.
66+
You can also try [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench) to benchmark performance on batch sizes larger than 1.
6767

6868

6969
## Using llama-perplexity for Model Evaluation
@@ -74,15 +74,16 @@ The `llama-perplexity` tool evaluates the model's quality on text datasets by ca
7474

7575
### Downloading a Test Dataset
7676

77-
First, let's download the Wikitest-2 test dataset.
77+
First, download the Wikitest-2 test dataset.
7878

7979
```bash
8080
sh scripts/get-wikitext-2.sh
8181
```
8282

8383
### Running Perplexity Evaluation
8484

85-
Now, let's measure perplexity on the test dataset
85+
Next, measure perplexity on the test dataset.
86+
8687
```bash
8788
bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-F16.gguf -f wikitext-2-raw/wiki.test.raw
8889
bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw
@@ -106,16 +107,13 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wik
106107
tail -f ppl.sh.log
107108
```
108109

109-
110110
Here are the full results.
111111

112-
113112
| Model | Generation Speed (tokens/s, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) |
114113
|:-------:|:----------------------:|:------------:|:----------:|
115114
| F16 | ~15–16 | ~15 GB | TODO |
116115
| Q8_0 | ~25 | ~8 GB | TODO |
117116
| Q4_0 | ~40 | ~4.4 GB | TODO |
118117

119-
120-
*Please remember to terminate the instance in the AWS console when you're done testing*
118+
When you have finished your benchmarking and evaluation, make sure to terminate your AWS EC2 instance in the AWS Management Console to avoid incurring unnecessary charges for unused compute resources.
121119

0 commit comments

Comments
 (0)