Skip to content

Commit b08a6ad

Browse files
committed
Create reusable assets to train Llama 2 on OKE.
I have created a tutorial to demonstrate how to train LLM on OKE using BM machines with optimal performance using RDMA. Future improvements would be instructions for training on smaller shapes.
1 parent 5d9b2f5 commit b08a6ad

File tree

12 files changed

+1268
-0
lines changed

12 files changed

+1268
-0
lines changed
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
Copyright (c) 2025 Oracle and/or its affiliates.
2+
3+
The Universal Permissive License (UPL), Version 1.0
4+
5+
Subject to the condition set forth below, permission is hereby granted to any
6+
person obtaining a copy of this software, associated documentation and/or data
7+
(collectively the "Software"), free of charge and under any and all copyright
8+
rights in the Software, and any and all patent rights owned or freely
9+
licensable by each licensor hereunder covering either (i) the unmodified
10+
Software as contributed to or provided by such licensor, or (ii) the Larger
11+
Works (as defined below), to deal in both
12+
13+
(a) the Software, and
14+
(b) any piece of software and/or hardware listed in the lrgrwrks.txt file if
15+
one is included with the Software (each a "Larger Work" to which the Software
16+
is contributed by such licensors),
17+
18+
without restriction, including without limitation the rights to copy, create
19+
derivative works of, display, perform, and distribute the Software and make,
20+
use, sell, offer for sale, import, export, have made, and have sold the
21+
Software and the Larger Work(s), and to sublicense the foregoing rights on
22+
either these or other terms.
23+
24+
This license is subject to the following condition:
25+
The above copyright notice and either this complete permission notice or at
26+
a minimum a reference to the UPL must be included in all copies or
27+
substantial portions of the Software.
28+
29+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
30+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
31+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
32+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
33+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
34+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
35+
SOFTWARE.
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes
2+
3+
This repository demonstrates how to train LLM using
4+
[NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/)
5+
on the Oracle Container Engine for Kubernetes (OKE) using
6+
[NVIDIA Megatron](https://developer.nvidia.com/megatron-core).
7+
8+
Reference results from NVIDIA to train Llama 2 can be found on the
9+
[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama2-dgxc-benchmarking).
10+
11+
Reviewed: dd.mm.yyyy
12+
13+
# When to use this asset?
14+
15+
* If you want to get started with training LLM like Llama 2 on Kubernetes using OCI.
16+
17+
# How to use this asset?
18+
19+
## Prerequisites
20+
21+
* You have access to an Orcale Cloud Tenancy.
22+
* You have access to shapes with NVIDIA GPUs such as A100.
23+
* You have a HuggingFace account and access to `meta-llama/Llama-2-70b-hf`.
24+
25+
This guide is loosely based on the
26+
[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html).
27+
28+
## Infrastructure Setup
29+
30+
1. Create an OKE cluster according
31+
[to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity),
32+
importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes.
33+
34+
The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes.
35+
36+
- Ensure that the follwing setting is selected under the "OKE Cluster" section:
37+
38+
> Disable OKE GPU device plugin
39+
40+
as this tutorial will install the GPU operator later.
41+
42+
2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match.
43+
Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers:
44+
* [Scale Out OCI File Storage Performance for AI/ML and
45+
Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf)
46+
* [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf)
47+
48+
3. Install Helm, the NVIDIA GPU Operator, and the Volcano scheduler according to
49+
[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html).
50+
51+
4. Apply the persistenv volume configuration and MPI parameter configuration map:
52+
```sh
53+
kubectl apply -f mpi.yaml
54+
kucectl apply -f pv.yaml
55+
```
56+
57+
5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`.
58+
59+
6. Copy the node sorting script and LLM configuration into the file system:
60+
```sh
61+
cp -R config utils /mnt/data
62+
```
63+
64+
## Data Preparation and Training
65+
66+
1. Download the tokenizer model from HuggingFace:
67+
```sh
68+
mkdir -p /mnt/data/tokenizer
69+
huggingface-cli login
70+
huggingface-cli download meta-llama/Llama-2-70b-hf tokenizer.model --local-dir /mnt/data/tokenizer
71+
huggingface-cli download meta-llama/Llama-2-70b-hf config.json --local-dir /mnt/data/tokenizer
72+
```
73+
74+
2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset:
75+
```sh
76+
kubectl apply -f preprocessing.yaml
77+
```
78+
79+
The progress can then be monitored by
80+
```sh
81+
kubectl logs -f nemo-megatron-preprocessing-mpimaster-0
82+
```
83+
84+
3. Following successful preprocessing, the training can be started with:
85+
```sh
86+
kubectl apply -f training_70b.yaml
87+
```
88+
89+
The progress can then be monitored by
90+
```sh
91+
kubectl logs -f nemo-megatron-training-mpimaster-0
92+
```
93+
94+
4. Calculate training throughput. For this, the following data is required from the training output:
95+
```
96+
[NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14]
97+
```
98+
This log can be saved into a file with:
99+
```sh
100+
kubectl logs nemo-megatron-training-mpimaster-0 > training.log
101+
```
102+
and the performance analyzed with
103+
```sh
104+
python3 utils/performance.py training.log
105+
```
106+
107+
## Changing the Configuration
108+
109+
* **Increase the training file count**
110+
111+
To increase the amount of training data, edit the file count by modifying all
112+
occurences of the `file_numbers=0-0` range in
113+
[`preprocessing.yaml`](./files/preprocessing.yaml) and re-run the
114+
preprocessing step.
115+
E.g. change this setting to `file_numbers=0-9` to process 10 files.
116+
117+
Then modify the file list in the training configuration, e.g.,
118+
[`config_7b.yaml`](./files/config/config_7b.yaml)
119+
to match the file count:
120+
```yaml
121+
data_prefix:
122+
- 1
123+
- /mnt/data/pile/my-gpt3_00_text_document
124+
...
125+
- 1
126+
- /mnt/data/pile/my-gpt3_09_text_document
127+
```
128+
129+
* **Vary the node count for training**
130+
131+
Changing the node count will require modifications to both the training
132+
configuration and the Volcano job. E.g. to double the node count for the 7B
133+
example, modify
134+
135+
* [`training_7b.yaml`](./files/training_7b.yaml) to have twice the replica
136+
count for the `mpiworker` definition
137+
* Double the `num_nodes` and `global_batch_size` keys in
138+
[`config_7b.yaml`](./files/config/config_7b.yaml). In the optimal case,
139+
this should give constant performance in terms of token throughput per
140+
second per GPU.
141+
142+
## Potential Issues
143+
144+
* **PyTorch can't resolve hostnames via c10d**
145+
146+
If the rendezvous backend for PyTorch fails to connect to an OCI style
147+
hostname for Kubernetes clusters, one work around this resolution failure by
148+
augmenting `/etc/hosts` for every pod.
149+
150+
For convenience, this is facilitated by enhancing `mpi.yaml` via
151+
```sh
152+
./utils/host_list.sh >> mpi.yaml
153+
kubectl apply -f mpi.yaml
154+
```
155+
and afterwards restarting the training job.
156+
157+
# Acknowledgments
158+
159+
- **Author** - Matthias Wolf (GPU Solution Specialist)
160+
161+
# License
162+
163+
Copyright (c) 2025 Oracle and/or its affiliates.
164+
165+
Licensed under the Universal Permissive License (UPL), Version 1.0.
166+
167+
See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.

0 commit comments

Comments
 (0)