@@ -48,47 +48,45 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
48
48
3 . Install Helm, the NVIDIA GPU Operator, and the Volcano scheduler according to
49
49
[ NVIDIA NeMo Framework Launcher guide for Kubernetes] ( https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html ) .
50
50
51
- 4 . Apply the persistenv volume configuration and MPI parameter configuration map:
51
+ 4 . Copy the [ files in this repository] ( ./files ) to the Kubernetes operator node.
52
+ You can download them from this repository via:
52
53
``` sh
53
- kubectl apply -f mpi.yaml
54
- kucectl apply -f pv.yaml
54
+ BRANCH=main
55
+ curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/ ${BRANCH} .tar.gz | tar xzf - --strip-components=6 technology-engineering- ${BRANCH} /cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files
55
56
```
57
+
58
+ Then modify the values in [ ` training/values.yaml ` ] ( ./files/training/values.yaml ) to match the storage server and export path.
56
59
57
60
5 . Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be ` /mnt/data/ ` .
58
61
59
- 6 . Copy the node sorting script and LLM configuration into the file system:
60
- ``` sh
61
- cp -R config utils /mnt/data
62
- ```
63
-
64
62
## Data Preparation and Training
65
63
66
64
1 . Download the tokenizer model from HuggingFace:
67
65
``` sh
68
66
mkdir -p /mnt/data/tokenizer
69
67
huggingface-cli login
70
- huggingface-cli download meta-llama/Llama-2-70b-hf tokenizer.model --local-dir /mnt/data/tokenizer
71
- huggingface-cli download meta-llama/Llama-2-70b-hf config .json --local-dir /mnt/data/tokenizer
68
+ huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer
69
+ huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer .json --local-dir /mnt/data/tokenizer
72
70
```
73
71
74
72
2 . Apply the preprocessing job that will download and tokenize parts of the Pile dataset:
75
73
``` sh
76
- kubectl apply -f preprocessing.yaml
74
+ helm install --set num_nodes=1 --set download_data=true " my- preprocessing" ./training
77
75
```
78
76
79
77
The progress can then be monitored by
80
78
``` sh
81
- kubectl logs -f nemo- megatron-preprocessing-mpimaster-0
79
+ kubectl logs -f megatron-prep-my -preprocessing-mpimaster-0
82
80
```
83
81
84
82
3 . Following successful preprocessing, the training can be started with:
85
83
``` sh
86
- kubectl apply -f training_70b.yaml
84
+ helm install --set num_nodes=1 " my-training-v0 " ./training
87
85
```
88
86
89
87
The progress can then be monitored by
90
88
``` sh
91
- kubectl logs -f nemo- megatron-training-mpimaster-0
89
+ kubectl logs -f megatron-train-my- training-v0 -mpimaster-0
92
90
```
93
91
94
92
4 . Calculate training throughput. For this, the following data is required from the training output:
@@ -97,48 +95,13 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
97
95
```
98
96
This log can be saved into a file with:
99
97
``` sh
100
- kubectl logs nemo- megatron-training-mpimaster-0 > training.log
98
+ kubectl logs megatron-train-my- training-v0 -mpimaster-0 > training.log
101
99
```
102
100
and the performance analyzed with
103
101
``` sh
104
102
python3 utils/performance.py training.log
105
103
```
106
104
107
- ## Changing the Configuration
108
-
109
- * ** Increase the training file count**
110
-
111
- To increase the amount of training data, edit the file count by modifying all
112
- occurences of the ` file_numbers=0-0 ` range in
113
- [ ` preprocessing.yaml ` ] ( ./files/preprocessing.yaml ) and re-run the
114
- preprocessing step.
115
- E.g. change this setting to ` file_numbers=0-9 ` to process 10 files.
116
-
117
- Then modify the file list in the training configuration, e.g.,
118
- [ ` config_7b.yaml ` ] ( ./files/config/config_7b.yaml )
119
- to match the file count:
120
- ``` yaml
121
- data_prefix :
122
- - 1
123
- - /mnt/data/pile/my-gpt3_00_text_document
124
- ...
125
- - 1
126
- - /mnt/data/pile/my-gpt3_09_text_document
127
- ` ` `
128
-
129
- * **Vary the node count for training**
130
-
131
- Changing the node count will require modifications to both the training
132
- configuration and the Volcano job. E.g. to double the node count for the 7B
133
- example, modify
134
-
135
- * [` training_7b.yaml`](./files/training_7b.yaml) to have twice the replica
136
- count for the `mpiworker` definition
137
- * Double the `num_nodes` and `global_batch_size` keys in
138
- [`config_7b.yaml`](./files/config/config_7b.yaml). In the optimal case,
139
- this should give constant performance in terms of token throughput per
140
- second per GPU.
141
-
142
105
## Potential Issues
143
106
144
107
* ** PyTorch can't resolve hostnames via c10d**
@@ -149,10 +112,9 @@ Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/A
149
112
150
113
For convenience, this is facilitated by enhancing ` mpi.yaml ` via
151
114
``` sh
152
- ./utils/host_list.sh >> mpi.yaml
153
- kubectl apply -f mpi.yaml
115
+ ./utils/host_list.sh >> ./training/files/mpi.yaml
154
116
```
155
- and afterwards restarting the training job.
117
+ and afterwards reinstalling the training job via Helm .
156
118
157
119
# Acknowledgments
158
120
0 commit comments