@@ -48,47 +48,45 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
48483 . Install Helm, the NVIDIA GPU Operator, and the Volcano scheduler according to
4949 [ NVIDIA NeMo Framework Launcher guide for Kubernetes] ( https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html ) .
5050
51- 4 . Apply the persistenv volume configuration and MPI parameter configuration map:
51+ 4 . Copy the [ files in this repository] ( ./files ) to the Kubernetes operator node.
52+ You can download them from this repository via:
5253 ``` sh
53- kubectl apply -f mpi.yaml
54- kucectl apply -f pv.yaml
54+ BRANCH=main
55+ curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/ ${BRANCH} .tar.gz | tar xzf - --strip-components=6 technology-engineering- ${BRANCH} /cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files
5556 ```
57+
58+ Then modify the values in [ ` training/values.yaml ` ] ( ./files/training/values.yaml ) to match the storage server and export path.
5659
57605 . Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be ` /mnt/data/ ` .
5861
59- 6 . Copy the node sorting script and LLM configuration into the file system:
60- ``` sh
61- cp -R config utils /mnt/data
62- ```
63-
6462## Data Preparation and Training
6563
66641 . Download the tokenizer model from HuggingFace:
6765 ``` sh
6866 mkdir -p /mnt/data/tokenizer
6967 huggingface-cli login
70- huggingface-cli download meta-llama/Llama-2-70b-hf tokenizer.model --local-dir /mnt/data/tokenizer
71- huggingface-cli download meta-llama/Llama-2-70b-hf config .json --local-dir /mnt/data/tokenizer
68+ huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer
69+ huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer .json --local-dir /mnt/data/tokenizer
7270 ```
7371
74722 . Apply the preprocessing job that will download and tokenize parts of the Pile dataset:
7573 ``` sh
76- kubectl apply -f preprocessing.yaml
74+ helm install --set num_nodes=1 --set download_data=true " my- preprocessing" ./training
7775 ```
7876
7977 The progress can then be monitored by
8078 ``` sh
81- kubectl logs -f nemo- megatron-preprocessing-mpimaster-0
79+ kubectl logs -f megatron-prep-my -preprocessing-mpimaster-0
8280 ```
8381
84823 . Following successful preprocessing, the training can be started with:
8583 ``` sh
86- kubectl apply -f training_70b.yaml
84+ helm install --set num_nodes=1 " my-training-v0 " ./training
8785 ```
8886
8987 The progress can then be monitored by
9088 ``` sh
91- kubectl logs -f nemo- megatron-training-mpimaster-0
89+ kubectl logs -f megatron-train-my- training-v0 -mpimaster-0
9290 ```
9391
94924 . Calculate training throughput. For this, the following data is required from the training output:
@@ -97,48 +95,13 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
9795 ```
9896 This log can be saved into a file with:
9997 ``` sh
100- kubectl logs nemo- megatron-training-mpimaster-0 > training.log
98+ kubectl logs megatron-train-my- training-v0 -mpimaster-0 > training.log
10199 ```
102100 and the performance analyzed with
103101 ``` sh
104102 python3 utils/performance.py training.log
105103 ```
106104
107- ## Changing the Configuration
108-
109- * ** Increase the training file count**
110-
111- To increase the amount of training data, edit the file count by modifying all
112- occurences of the ` file_numbers=0-0 ` range in
113- [ ` preprocessing.yaml ` ] ( ./files/preprocessing.yaml ) and re-run the
114- preprocessing step.
115- E.g. change this setting to ` file_numbers=0-9 ` to process 10 files.
116-
117- Then modify the file list in the training configuration, e.g.,
118- [ ` config_7b.yaml ` ] ( ./files/config/config_7b.yaml )
119- to match the file count:
120- ``` yaml
121- data_prefix :
122- - 1
123- - /mnt/data/pile/my-gpt3_00_text_document
124- ...
125- - 1
126- - /mnt/data/pile/my-gpt3_09_text_document
127- ` ` `
128-
129- * **Vary the node count for training**
130-
131- Changing the node count will require modifications to both the training
132- configuration and the Volcano job. E.g. to double the node count for the 7B
133- example, modify
134-
135- * [` training_7b.yaml`](./files/training_7b.yaml) to have twice the replica
136- count for the `mpiworker` definition
137- * Double the `num_nodes` and `global_batch_size` keys in
138- [`config_7b.yaml`](./files/config/config_7b.yaml). In the optimal case,
139- this should give constant performance in terms of token throughput per
140- second per GPU.
141-
142105## Potential Issues
143106
144107* ** PyTorch can't resolve hostnames via c10d**
@@ -149,10 +112,9 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
149112
150113 For convenience, this is facilitated by enhancing ` mpi.yaml ` via
151114 ``` sh
152- ./utils/host_list.sh >> mpi.yaml
153- kubectl apply -f mpi.yaml
115+ ./utils/host_list.sh >> ./training/files/mpi.yaml
154116 ```
155- and afterwards restarting the training job.
117+ and afterwards reinstalling the training job via Helm .
156118
157119# Acknowledgments
158120
0 commit comments