Skip to content

Commit 3989113

Browse files
authored
Merge pull request #2042 from nobelchowdary/pytorch-llama
fixed the pytorch-llama learning path - torch, torchao and torchchat fixes
2 parents 021aa6f + a460bde commit 3989113

File tree

1 file changed

+12
-21
lines changed
  • content/learning-paths/servers-and-cloud-computing/pytorch-llama

1 file changed

+12
-21
lines changed

content/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama.md

Lines changed: 12 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ layout: learningpathall
77
---
88

99
## Before you begin
10-
The instructions in this Learning Path are for any Arm server running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores and 64GB of RAM to run this example. Configure disk storage up to at least 50 GB. The instructions have been tested on an AWS Graviton4 r8g.4xlarge instance.
10+
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 16 cores and 64GB of RAM to run this example. Configure disk storage up to at least 50 GB. The instructions have been tested on an AWS Graviton4 r8g.4xlarge instance.
1111

1212
## Overview
1313
Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you learn how to run generative AI inference-based use cases like a LLM chatbot using PyTorch on Arm-based CPUs. PyTorch is a popular deep learning framework for AI applications.
@@ -31,16 +31,17 @@ source torch_env/bin/activate
3131
```
3232

3333
### Install PyTorch and optimized libraries
34-
Torchchat is a library developed by the PyTorch team that facilitates running large language models (LLMs) seamlessly on a variety of devices. TorchAO (Torch Architecture Optimization) is a PyTorch library designed for enhancing the performance of ML models through different quantization and sparsity methods.
34+
Torchchat is a library developed by the PyTorch team that facilitates running large language models (LLMs) seamlessly on a variety of devices. TorchAO (Torch Architecture Optimization) is a PyTorch library designed for enhancing the performance of ML models through different quantization and sparsity methods.
3535

36-
Start by cloning the torchao and torchchat repositories and then applying the Arm specific patches:
36+
Start by installing pytorch and cloning the torchao and torchchat repositories:
3737

3838
```sh
39+
pip install torch
3940
git clone --recursive https://github.com/pytorch/ao.git
4041
cd ao
41-
git checkout 174e630af2be8cd18bc47c5e530765a82e97f45b
42-
wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches/main/0001-Feat-Add-support-for-kleidiai-quantization-schemes.patch
43-
git apply --whitespace=nowarn 0001-Feat-Add-support-for-kleidiai-quantization-schemes.patch
42+
git checkout e1cb44ab84eee0a3573bb161d65c18661dc4a307
43+
curl -L https://github.com/pytorch/ao/commit/738d7f2c5a48367822f2bf9d538160d19f02341e.patch | git apply
44+
python3 setup.py install
4445
cd ../
4546

4647
git clone --recursive https://github.com/pytorch/torchchat.git
@@ -50,18 +51,9 @@ wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches
5051
wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches/main/0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch
5152
git apply 0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch
5253
git apply --whitespace=nowarn 0001-modified-generate.py-for-cli-and-browser.patch
54+
sed -i 's/"groupsize": 0/"groupsize": 32/' config/data/aarch64_cpu_channelwise.json
5355
pip install -r requirements.txt
5456
```
55-
{{% notice Note %}} You will need Python version 3.10 to apply these patches. This is the default version of Python installed on an Ubuntu 22.04 Linux machine. {{% /notice %}}
56-
57-
You will now override the installed PyTorch version with a specific version of PyTorch required to take advantage of Arm KleidiAI optimizations:
58-
59-
```
60-
wget https://github.com/ArmDeveloperEcosystem/PyTorch-arm-patches/raw/main/torch-2.5.0.dev20240828+cpu-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
61-
pip install --force-reinstall torch-2.5.0.dev20240828+cpu-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
62-
cd ..
63-
pip uninstall torchao && cd ao/ && rm -rf build && python setup.py install
64-
```
6557

6658
### Login to Hugging Face
6759
You can now download the LLM.
@@ -73,7 +65,7 @@ pip install -U "huggingface_hub[cli]"
7365

7466
[Generate an Access Token](https://huggingface.co/settings/tokens) to authenticate your identity with Hugging Face Hub. A token with read-only access is sufficient.
7567

76-
Log in to the Hugging Face repository and enter your Access Token key from Hugging face.
68+
Log in to the Hugging Face repository and enter your Access Token key from Hugging face.
7769

7870
```sh
7971
huggingface-cli login
@@ -86,8 +78,7 @@ In this step, you will download the [Meta Llama3.1 8B Instruct model](https://hu
8678

8779

8880
```sh
89-
cd ../torchchat
90-
python torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so --quantize config/data/aarch64_cpu_channelwise.json --device cpu --max-seq-length 1024
81+
python3 torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so --quantize config/data/aarch64_cpu_channelwise.json --device cpu --max-seq-length 1024
9182
```
9283
The output from this command should look like:
9384

@@ -108,7 +99,7 @@ You can now run the LLM on the Arm CPU on your server.
10899
To run the model inference:
109100

110101
```sh
111-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 32 --chat
102+
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python3 torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 32 --chat
112103
```
113104
The output from running the inference will look like:
114105

@@ -140,4 +131,4 @@ Bandwidth achieved: 254.17 GB/s
140131
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***
141132
```
142133

143-
You have successfully run the Llama3.1 8B Instruct Model on your Arm-based server. In the next section, you will walk through the steps to run the same chatbot in your browser.
134+
You have successfully run the Llama3.1 8B Instruct Model on your Arm-based server. In the next section, you will walk through the steps to run the same chatbot in your browser.

0 commit comments

Comments
 (0)