|
| 1 | +# Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes |
| 2 | + |
| 3 | +This repository demonstrates how to train LLM using |
| 4 | +[NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/) |
| 5 | +on the Oracle Container Engine for Kubernetes (OKE) using |
| 6 | +[NVIDIA Megatron](https://developer.nvidia.com/megatron-core). |
| 7 | + |
| 8 | +Reference results from NVIDIA to train Llama 3 can be found on the |
| 9 | +[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama3-dgxc-benchmarking). |
| 10 | + |
| 11 | +Reviewed: 18.03.2025 |
| 12 | + |
| 13 | +# When to use this asset? |
| 14 | + |
| 15 | +* If you want to get started with training LLM like Llama 3 on Kubernetes using OCI. |
| 16 | + |
| 17 | +# How to use this asset? |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +* You have access to an Orcale Cloud Tenancy. |
| 22 | +* You have access to shapes with NVIDIA GPUs such as H100. |
| 23 | +* You have a HuggingFace account and access to `meta-llama/Llama-3.1-8B-Instruct`. |
| 24 | + |
| 25 | +This guide is loosely based on the |
| 26 | +[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html). |
| 27 | + |
| 28 | +## Infrastructure Setup |
| 29 | + |
| 30 | +1. Create an OKE cluster according |
| 31 | + [to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity), |
| 32 | + importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes. |
| 33 | + |
| 34 | + The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes. |
| 35 | + |
| 36 | + - Ensure that the follwing setting is selected under the "OKE Cluster" section: |
| 37 | + |
| 38 | + > Disable OKE GPU device plugin |
| 39 | +
|
| 40 | + as this tutorial will install the GPU operator later. |
| 41 | + |
| 42 | +2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match. |
| 43 | + Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers: |
| 44 | + * [Scale Out OCI File Storage Performance for AI/ML and |
| 45 | +Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf) |
| 46 | + * [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf) |
| 47 | + |
| 48 | +3. Install the NVIDIA GPU Operator according to |
| 49 | + [NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html), then install the [Volcano scheduler](https://github.com/volcano-sh/volcano) with: |
| 50 | + ```sh |
| 51 | + kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml |
| 52 | + ``` |
| 53 | + |
| 54 | +4. Copy the [files in this repository](./files) to the Kubernetes operator node. |
| 55 | + You can download them from this repository via: |
| 56 | + ```sh |
| 57 | + BRANCH=main |
| 58 | + curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/${BRANCH}.tar.gz|tar xzf - --strip-components=6 technology-engineering-${BRANCH}/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files |
| 59 | + ``` |
| 60 | + |
| 61 | + Then modify the values in [`training/values.yaml`](./files/training/values.yaml) to match the storage server and export path. |
| 62 | + |
| 63 | +5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`. |
| 64 | + |
| 65 | +## Data Preparation and Training |
| 66 | + |
| 67 | +1. Download the tokenizer model from HuggingFace: |
| 68 | + ```sh |
| 69 | + mkdir -p /mnt/data/tokenizer |
| 70 | + huggingface-cli login |
| 71 | + huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer |
| 72 | + huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer.json --local-dir /mnt/data/tokenizer |
| 73 | + ``` |
| 74 | + |
| 75 | +2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset: |
| 76 | + ```sh |
| 77 | + helm install --set num_nodes=1 --set download_data=true "my-preprocessing" ./training |
| 78 | + ``` |
| 79 | + |
| 80 | + The progress can then be monitored by |
| 81 | + ```sh |
| 82 | + kubectl logs -f megatron-prep-my-preprocessing-mpimaster-0 |
| 83 | + ``` |
| 84 | + |
| 85 | +3. Following successful preprocessing, the training can be started with: |
| 86 | + ```sh |
| 87 | + helm install --set num_nodes=1 "my-training-v0" ./training |
| 88 | + ``` |
| 89 | + |
| 90 | + The progress can then be monitored by |
| 91 | + ```sh |
| 92 | + kubectl logs -f megatron-train-my-training-v0-mpimaster-0 |
| 93 | + ``` |
| 94 | + |
| 95 | +4. Calculate training throughput. For this, the following data is required from the training output: |
| 96 | + ``` |
| 97 | + [NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14] |
| 98 | + ``` |
| 99 | + This log can be saved into a file with: |
| 100 | + ```sh |
| 101 | + kubectl logs megatron-train-my-training-v0-mpimaster-0 > training.log |
| 102 | + ``` |
| 103 | + and the performance analyzed with |
| 104 | + ```sh |
| 105 | + python3 utils/performance.py training.log |
| 106 | + ``` |
| 107 | + |
| 108 | +## Potential Issues |
| 109 | + |
| 110 | +* **PyTorch can't resolve hostnames via c10d** |
| 111 | + |
| 112 | + If the rendezvous backend for PyTorch fails to connect to an OCI style |
| 113 | + hostname for Kubernetes clusters, one work around this resolution failure by |
| 114 | + augmenting `/etc/hosts` for every pod. |
| 115 | + |
| 116 | + For convenience, this is facilitated by enhancing `mpi.yaml` via |
| 117 | + ```sh |
| 118 | + ./utils/host_list.sh >> ./training/files/mpi.yaml |
| 119 | + ``` |
| 120 | + and afterwards reinstalling the training job via Helm. |
| 121 | + |
| 122 | +# Acknowledgments |
| 123 | + |
| 124 | +- **Author** - Matthias Wolf (GPU Solution Specialist) |
| 125 | + |
| 126 | +# License |
| 127 | + |
| 128 | +Copyright (c) 2025 Oracle and/or its affiliates. |
| 129 | + |
| 130 | +Licensed under the Universal Permissive License (UPL), Version 1.0. |
| 131 | + |
| 132 | +See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details. |
0 commit comments