Skip to content

Commit 026b217

Browse files
committed
Reorganize as a Helm chart.
This should make it a lot simpler to get started.
1 parent 6d19e72 commit 026b217

File tree

14 files changed

+436
-425
lines changed

14 files changed

+436
-425
lines changed

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/README.md

Lines changed: 15 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -48,47 +48,45 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
4848
3. Install Helm, the NVIDIA GPU Operator, and the Volcano scheduler according to
4949
[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html).
5050

51-
4. Apply the persistenv volume configuration and MPI parameter configuration map:
51+
4. Copy the [files in this repository](./files) to the Kubernetes operator node.
52+
You can download them from this repository via:
5253
```sh
53-
kubectl apply -f mpi.yaml
54-
kucectl apply -f pv.yaml
54+
BRANCH=main
55+
curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/${BRANCH}.tar.gz|tar xzf - --strip-components=6 technology-engineering-${BRANCH}/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files
5556
```
57+
58+
Then modify the values in [`training/values.yaml`](./files/training/values.yaml) to match the storage server and export path.
5659

5760
5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`.
5861

59-
6. Copy the node sorting script and LLM configuration into the file system:
60-
```sh
61-
cp -R config utils /mnt/data
62-
```
63-
6462
## Data Preparation and Training
6563

6664
1. Download the tokenizer model from HuggingFace:
6765
```sh
6866
mkdir -p /mnt/data/tokenizer
6967
huggingface-cli login
70-
huggingface-cli download meta-llama/Llama-2-70b-hf tokenizer.model --local-dir /mnt/data/tokenizer
71-
huggingface-cli download meta-llama/Llama-2-70b-hf config.json --local-dir /mnt/data/tokenizer
68+
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer
69+
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer.json --local-dir /mnt/data/tokenizer
7270
```
7371

7472
2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset:
7573
```sh
76-
kubectl apply -f preprocessing.yaml
74+
helm install --set num_nodes=1 --set download_data=true "my-preprocessing" ./training
7775
```
7876

7977
The progress can then be monitored by
8078
```sh
81-
kubectl logs -f nemo-megatron-preprocessing-mpimaster-0
79+
kubectl logs -f megatron-prep-my-preprocessing-mpimaster-0
8280
```
8381

8482
3. Following successful preprocessing, the training can be started with:
8583
```sh
86-
kubectl apply -f training_70b.yaml
84+
helm install --set num_nodes=1 "my-training-v0" ./training
8785
```
8886

8987
The progress can then be monitored by
9088
```sh
91-
kubectl logs -f nemo-megatron-training-mpimaster-0
89+
kubectl logs -f megatron-train-my-training-v0-mpimaster-0
9290
```
9391

9492
4. Calculate training throughput. For this, the following data is required from the training output:
@@ -97,48 +95,13 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
9795
```
9896
This log can be saved into a file with:
9997
```sh
100-
kubectl logs nemo-megatron-training-mpimaster-0 > training.log
98+
kubectl logs megatron-train-my-training-v0-mpimaster-0 > training.log
10199
```
102100
and the performance analyzed with
103101
```sh
104102
python3 utils/performance.py training.log
105103
```
106104

107-
## Changing the Configuration
108-
109-
* **Increase the training file count**
110-
111-
To increase the amount of training data, edit the file count by modifying all
112-
occurences of the `file_numbers=0-0` range in
113-
[`preprocessing.yaml`](./files/preprocessing.yaml) and re-run the
114-
preprocessing step.
115-
E.g. change this setting to `file_numbers=0-9` to process 10 files.
116-
117-
Then modify the file list in the training configuration, e.g.,
118-
[`config_7b.yaml`](./files/config/config_7b.yaml)
119-
to match the file count:
120-
```yaml
121-
data_prefix:
122-
- 1
123-
- /mnt/data/pile/my-gpt3_00_text_document
124-
...
125-
- 1
126-
- /mnt/data/pile/my-gpt3_09_text_document
127-
```
128-
129-
* **Vary the node count for training**
130-
131-
Changing the node count will require modifications to both the training
132-
configuration and the Volcano job. E.g. to double the node count for the 7B
133-
example, modify
134-
135-
* [`training_7b.yaml`](./files/training_7b.yaml) to have twice the replica
136-
count for the `mpiworker` definition
137-
* Double the `num_nodes` and `global_batch_size` keys in
138-
[`config_7b.yaml`](./files/config/config_7b.yaml). In the optimal case,
139-
this should give constant performance in terms of token throughput per
140-
second per GPU.
141-
142105
## Potential Issues
143106

144107
* **PyTorch can't resolve hostnames via c10d**
@@ -149,10 +112,9 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
149112

150113
For convenience, this is facilitated by enhancing `mpi.yaml` via
151114
```sh
152-
./utils/host_list.sh >> mpi.yaml
153-
kubectl apply -f mpi.yaml
115+
./utils/host_list.sh >> ./training/files/mpi.yaml
154116
```
155-
and afterwards restarting the training job.
117+
and afterwards reinstalling the training job via Helm.
156118

157119
# Acknowledgments
158120

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Patterns to ignore when building packages.
2+
# This supports shell glob matching, relative path matching, and
3+
# negation (prefixed with !). Only one pattern per line.
4+
.DS_Store
5+
# Common VCS dirs
6+
.git/
7+
.gitignore
8+
.bzr/
9+
.bzrignore
10+
.hg/
11+
.hgignore
12+
.svn/
13+
# Common backup files
14+
*.swp
15+
*.bak
16+
*.tmp
17+
*.orig
18+
*~
19+
# Various IDEs
20+
.project
21+
.idea/
22+
*.tmproj
23+
.vscode/
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: v2
2+
name: training
3+
description: A Helm chart to train LLM on Kubernetes using NVIDIA NeMo and Megatron
4+
type: application
5+
version: 0.1.0
6+
appVersion: "1.16.0"

0 commit comments

Comments
 (0)