Merge pull request #2042 from nobelchowdary/pytorch-llama

pareenaverma · web-flow · commit 3989113f3efc · 2025-06-16T17:51:41.000-04:00
fixed the pytorch-llama learning path - torch, torchao and torchchat fixes
diff --git a/content/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama.md b/content/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama.md
@@ -7,7 +7,7 @@ layout: learningpathall
 ---
 
 ## Before you begin
-The instructions in this Learning Path are for any Arm server running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores and 64GB of RAM to run this example. Configure disk storage up to at least 50 GB. The instructions have been tested on an AWS Graviton4 r8g.4xlarge instance.
+The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 16 cores and 64GB of RAM to run this example. Configure disk storage up to at least 50 GB. The instructions have been tested on an AWS Graviton4 r8g.4xlarge instance.
 
 ## Overview
 Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you learn how to run generative AI inference-based use cases like a LLM chatbot using PyTorch on Arm-based CPUs. PyTorch is a popular deep learning framework for AI applications.
@@ -31,16 +31,17 @@ source torch_env/bin/activate
 ```
 
 ### Install PyTorch and optimized libraries
-Torchchat is a library developed by the PyTorch team that facilitates running large language models (LLMs) seamlessly on a variety of devices. TorchAO (Torch Architecture Optimization) is a PyTorch library designed for enhancing the performance of ML models through different quantization and sparsity methods. 
+Torchchat is a library developed by the PyTorch team that facilitates running large language models (LLMs) seamlessly on a variety of devices. TorchAO (Torch Architecture Optimization) is a PyTorch library designed for enhancing the performance of ML models through different quantization and sparsity methods.
 
-Start by cloning the torchao and torchchat repositories and then applying the Arm specific patches:
+Start by installing pytorch and cloning the torchao and torchchat repositories:
 
 ```sh
+pip install torch
 git clone --recursive https://github.com/pytorch/ao.git
 cd ao
-git checkout 174e630af2be8cd18bc47c5e530765a82e97f45b
-wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches/main/0001-Feat-Add-support-for-kleidiai-quantization-schemes.patch
-git apply --whitespace=nowarn 0001-Feat-Add-support-for-kleidiai-quantization-schemes.patch
+git checkout e1cb44ab84eee0a3573bb161d65c18661dc4a307
+curl -L https://github.com/pytorch/ao/commit/738d7f2c5a48367822f2bf9d538160d19f02341e.patch | git apply
+python3 setup.py install
 cd ../
 
 git clone --recursive https://github.com/pytorch/torchchat.git
@@ -50,18 +51,9 @@ wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches
 wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches/main/0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch
 git apply 0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch
 git apply --whitespace=nowarn 0001-modified-generate.py-for-cli-and-browser.patch
+sed -i 's/"groupsize": 0/"groupsize": 32/' config/data/aarch64_cpu_channelwise.json
 pip install -r requirements.txt
 ```
-{{% notice Note %}} You will need Python version 3.10 to apply these patches. This is the default version of Python installed on an Ubuntu 22.04 Linux machine. {{% /notice %}}
-
-You will now override the installed PyTorch version with a specific version of PyTorch required to take advantage of Arm KleidiAI optimizations:
-
-```
-wget https://github.com/ArmDeveloperEcosystem/PyTorch-arm-patches/raw/main/torch-2.5.0.dev20240828+cpu-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-pip install --force-reinstall torch-2.5.0.dev20240828+cpu-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
-cd ..
-pip uninstall torchao && cd ao/ && rm -rf build && python setup.py install
-```
 
 ### Login to Hugging Face
 You can now download the LLM.
@@ -73,7 +65,7 @@ pip install -U "huggingface_hub[cli]"
 
 [Generate an Access Token](https://huggingface.co/settings/tokens) to authenticate your identity with Hugging Face Hub. A token with read-only access is sufficient.
 
-Log in to the Hugging Face repository and enter your Access Token key from Hugging face. 
+Log in to the Hugging Face repository and enter your Access Token key from Hugging face.
 
 ```sh
 huggingface-cli login
@@ -86,8 +78,7 @@ In this step, you will download the [Meta Llama3.1 8B Instruct model](https://hu
 
 
 ```sh
-cd ../torchchat
-python torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so --quantize config/data/aarch64_cpu_channelwise.json --device cpu --max-seq-length 1024
+python3 torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so --quantize config/data/aarch64_cpu_channelwise.json --device cpu --max-seq-length 1024
 ```
 The output from this command should look like:
 
@@ -108,7 +99,7 @@ You can now run the LLM on the Arm CPU on your server.
 To run the model inference:
 
 ```sh
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 32 --chat
+LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python3 torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 32 --chat
 ```
 The output from running the inference will look like:
 
@@ -140,4 +131,4 @@ Bandwidth achieved: 254.17 GB/s
 *** This first iteration will include cold start effects for dynamic import, hardware caches. ***
 ```
 
-You have successfully run the Llama3.1 8B Instruct Model on your Arm-based server. In the next section, you will walk through the steps to run the same chatbot in your browser. 
+You have successfully run the Llama3.1 8B Instruct Model on your Arm-based server. In the next section, you will walk through the steps to run the same chatbot in your browser.