|
| 1 | +# Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes |
| 2 | + |
| 3 | +This repository demonstrates how to train LLM using |
| 4 | +[NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/) |
| 5 | +on the Oracle Container Engine for Kubernetes (OKE) using |
| 6 | +[NVIDIA Megatron](https://developer.nvidia.com/megatron-core). |
| 7 | + |
| 8 | +Reference results from NVIDIA to train Llama 2 can be found on the |
| 9 | +[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama2-dgxc-benchmarking). |
| 10 | + |
| 11 | +Reviewed: dd.mm.yyyy |
| 12 | + |
| 13 | +# When to use this asset? |
| 14 | + |
| 15 | +* If you want to get started with training LLM like Llama 2 on Kubernetes using OCI. |
| 16 | + |
| 17 | +# How to use this asset? |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +* You have access to an Orcale Cloud Tenancy. |
| 22 | +* You have access to shapes with NVIDIA GPUs such as A100. |
| 23 | +* You have a HuggingFace account and access to `meta-llama/Llama-2-70b-hf`. |
| 24 | + |
| 25 | +This guide is loosely based on the |
| 26 | +[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html). |
| 27 | + |
| 28 | +## Infrastructure Setup |
| 29 | + |
| 30 | +1. Create an OKE cluster according |
| 31 | + [to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity), |
| 32 | + importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes. |
| 33 | + |
| 34 | + The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes. |
| 35 | + |
| 36 | + - Ensure that the follwing setting is selected under the "OKE Cluster" section: |
| 37 | + |
| 38 | + > Disable OKE GPU device plugin |
| 39 | +
|
| 40 | + as this tutorial will install the GPU operator later. |
| 41 | + |
| 42 | +2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match. |
| 43 | + Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers: |
| 44 | + * [Scale Out OCI File Storage Performance for AI/ML and |
| 45 | +Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf) |
| 46 | + * [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf) |
| 47 | + |
| 48 | +3. Install Helm, the NVIDIA GPU Operator, and the Volcano scheduler according to |
| 49 | + [NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html). |
| 50 | + |
| 51 | +4. Apply the persistenv volume configuration and MPI parameter configuration map: |
| 52 | + ```sh |
| 53 | + kubectl apply -f mpi.yaml |
| 54 | + kucectl apply -f pv.yaml |
| 55 | + ``` |
| 56 | + |
| 57 | +5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`. |
| 58 | + |
| 59 | +6. Copy the node sorting script and LLM configuration into the file system: |
| 60 | + ```sh |
| 61 | + cp -R config utils /mnt/data |
| 62 | + ``` |
| 63 | + |
| 64 | +## Data Preparation and Training |
| 65 | + |
| 66 | +1. Download the tokenizer model from HuggingFace: |
| 67 | + ```sh |
| 68 | + mkdir -p /mnt/data/tokenizer |
| 69 | + huggingface-cli login |
| 70 | + huggingface-cli download meta-llama/Llama-2-70b-hf tokenizer.model --local-dir /mnt/data/tokenizer |
| 71 | + huggingface-cli download meta-llama/Llama-2-70b-hf config.json --local-dir /mnt/data/tokenizer |
| 72 | + ``` |
| 73 | + |
| 74 | +2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset: |
| 75 | + ```sh |
| 76 | + kubectl apply -f preprocessing.yaml |
| 77 | + ``` |
| 78 | + |
| 79 | + The progress can then be monitored by |
| 80 | + ```sh |
| 81 | + kubectl logs -f nemo-megatron-preprocessing-mpimaster-0 |
| 82 | + ``` |
| 83 | + |
| 84 | +3. Following successful preprocessing, the training can be started with: |
| 85 | + ```sh |
| 86 | + kubectl apply -f training_70b.yaml |
| 87 | + ``` |
| 88 | + |
| 89 | + The progress can then be monitored by |
| 90 | + ```sh |
| 91 | + kubectl logs -f nemo-megatron-training-mpimaster-0 |
| 92 | + ``` |
| 93 | + |
| 94 | +4. Calculate training throughput. For this, the following data is required from the training output: |
| 95 | + ``` |
| 96 | + [NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14] |
| 97 | + ``` |
| 98 | + This log can be saved into a file with: |
| 99 | + ```sh |
| 100 | + kubectl logs nemo-megatron-training-mpimaster-0 > training.log |
| 101 | + ``` |
| 102 | + and the performance analyzed with |
| 103 | + ```sh |
| 104 | + python3 utils/performance.py training.log |
| 105 | + ``` |
| 106 | + |
| 107 | +## Changing the Configuration |
| 108 | + |
| 109 | +* **Increase the training file count** |
| 110 | + |
| 111 | + To increase the amount of training data, edit the file count by modifying all |
| 112 | + occurences of the `file_numbers=0-0` range in |
| 113 | + [`preprocessing.yaml`](./files/preprocessing.yaml) and re-run the |
| 114 | + preprocessing step. |
| 115 | + E.g. change this setting to `file_numbers=0-9` to process 10 files. |
| 116 | + |
| 117 | + Then modify the file list in the training configuration, e.g., |
| 118 | + [`config_7b.yaml`](./files/config/config_7b.yaml) |
| 119 | + to match the file count: |
| 120 | + ```yaml |
| 121 | + data_prefix: |
| 122 | + - 1 |
| 123 | + - /mnt/data/pile/my-gpt3_00_text_document |
| 124 | + ... |
| 125 | + - 1 |
| 126 | + - /mnt/data/pile/my-gpt3_09_text_document |
| 127 | + ``` |
| 128 | +
|
| 129 | +* **Vary the node count for training** |
| 130 | +
|
| 131 | + Changing the node count will require modifications to both the training |
| 132 | + configuration and the Volcano job. E.g. to double the node count for the 7B |
| 133 | + example, modify |
| 134 | +
|
| 135 | + * [`training_7b.yaml`](./files/training_7b.yaml) to have twice the replica |
| 136 | + count for the `mpiworker` definition |
| 137 | + * Double the `num_nodes` and `global_batch_size` keys in |
| 138 | + [`config_7b.yaml`](./files/config/config_7b.yaml). In the optimal case, |
| 139 | + this should give constant performance in terms of token throughput per |
| 140 | + second per GPU. |
| 141 | + |
| 142 | +## Potential Issues |
| 143 | + |
| 144 | +* **PyTorch can't resolve hostnames via c10d** |
| 145 | + |
| 146 | + If the rendezvous backend for PyTorch fails to connect to an OCI style |
| 147 | + hostname for Kubernetes clusters, one work around this resolution failure by |
| 148 | + augmenting `/etc/hosts` for every pod. |
| 149 | + |
| 150 | + For convenience, this is facilitated by enhancing `mpi.yaml` via |
| 151 | + ```sh |
| 152 | + ./utils/host_list.sh >> mpi.yaml |
| 153 | + kubectl apply -f mpi.yaml |
| 154 | + ``` |
| 155 | + and afterwards restarting the training job. |
| 156 | + |
| 157 | +# Acknowledgments |
| 158 | + |
| 159 | +- **Author** - Matthias Wolf (GPU Solution Specialist) |
| 160 | + |
| 161 | +# License |
| 162 | + |
| 163 | +Copyright (c) 2025 Oracle and/or its affiliates. |
| 164 | + |
| 165 | +Licensed under the Universal Permissive License (UPL), Version 1.0. |
| 166 | + |
| 167 | +See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details. |
0 commit comments