|
1 | | -# Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes |
| 1 | +# Overview |
2 | 2 |
|
3 | | -This repository demonstrates how to train LLM using |
4 | | -[NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/) |
5 | | -on the Oracle Container Engine for Kubernetes (OKE) using |
6 | | -[NVIDIA Megatron](https://developer.nvidia.com/megatron-core). |
| 3 | +This repository provides a step-by-step deployment of DeepSpeed training for Large Language Models (LLMs) on Oracle Cloud Infrastructure (OCI), using H100 GPU clusters with RDMA and SLURM. |
7 | 4 |
|
8 | | -Reference results from NVIDIA to train Llama 3 can be found on the |
9 | | -[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama3-dgxc-benchmarking). |
10 | | - |
11 | | -Reviewed: 18.03.2025 |
| 5 | +This setup includes a tuned DeepSpeed configuration (`tuned_ds_config.json`) that provides up to **13% performance improvement** over standard configurations. |
12 | 6 |
|
| 7 | +Reviewed: 06.06.2025 |
13 | 8 | # When to use this asset? |
14 | 9 |
|
15 | | -* If you want to get started with training LLM like Llama 3 on Kubernetes using OCI. |
| 10 | +Use this asset when you need to: |
| 11 | +- Train large-scale language models on OCI with H100 hardware |
| 12 | +- Utilize RDMA-enabled SLURM clusters for distributed multi-node DeepSpeed training |
| 13 | +- Achieve improved throughput via custom-tuned DeepSpeed JSON configs |
16 | 14 |
|
17 | 15 | # How to use this asset? |
18 | 16 |
|
19 | | -## Prerequisites |
20 | | - |
21 | | -* You have access to an Orcale Cloud Tenancy. |
22 | | -* You have access to shapes with NVIDIA GPUs such as H100. |
23 | | -* You have a HuggingFace account and access to `meta-llama/Llama-3.1-8B-Instruct`. |
24 | | - |
25 | | -This guide is loosely based on the |
26 | | -[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html). |
27 | | - |
28 | | -## Infrastructure Setup |
29 | | - |
30 | | -1. Create an OKE cluster according |
31 | | - [to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity), |
32 | | - importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes. |
33 | | - |
34 | | - The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes. |
35 | | - |
36 | | - - Ensure that the follwing setting is selected under the "OKE Cluster" section: |
37 | | - |
38 | | - > Disable OKE GPU device plugin |
39 | | -
|
40 | | - as this tutorial will install the GPU operator later. |
41 | | - |
42 | | -2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match. |
43 | | - Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers: |
44 | | - * [Scale Out OCI File Storage Performance for AI/ML and |
45 | | -Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf) |
46 | | - * [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf) |
47 | | - |
48 | | -3. Install the NVIDIA GPU Operator according to |
49 | | - [NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html), then install the [Volcano scheduler](https://github.com/volcano-sh/volcano) with: |
50 | | - ```sh |
51 | | - kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml |
52 | | - ``` |
53 | | - |
54 | | -4. Copy the [files in this repository](./files) to the Kubernetes operator node. |
55 | | - You can download them from this repository via: |
56 | | - ```sh |
57 | | - BRANCH=main |
58 | | - curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/${BRANCH}.tar.gz|tar xzf - --strip-components=6 technology-engineering-${BRANCH}/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files |
59 | | - ``` |
60 | | - |
61 | | - Then modify the values in [`training/values.yaml`](./files/training/values.yaml) to match the storage server and export path. |
62 | | - |
63 | | -5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`. |
64 | | - |
65 | | -## Data Preparation and Training |
66 | | - |
67 | | -1. Download the tokenizer model from HuggingFace: |
68 | | - ```sh |
69 | | - mkdir -p /mnt/data/tokenizer |
70 | | - huggingface-cli login |
71 | | - huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer |
72 | | - huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer.json --local-dir /mnt/data/tokenizer |
73 | | - ``` |
74 | | - |
75 | | -2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset: |
76 | | - ```sh |
77 | | - helm install --set num_nodes=1 --set download_data=true "my-preprocessing" ./training |
78 | | - ``` |
79 | | - |
80 | | - The progress can then be monitored by |
81 | | - ```sh |
82 | | - kubectl logs -f megatron-prep-my-preprocessing-mpimaster-0 |
83 | | - ``` |
84 | | - |
85 | | -3. Following successful preprocessing, the training can be started with: |
86 | | - ```sh |
87 | | - helm install --set num_nodes=1 "my-training-v0" ./training |
88 | | - ``` |
89 | | - |
90 | | - The progress can then be monitored by |
91 | | - ```sh |
92 | | - kubectl logs -f megatron-train-my-training-v0-mpimaster-0 |
93 | | - ``` |
94 | | - |
95 | | -4. Calculate training throughput. For this, the following data is required from the training output: |
96 | | - ``` |
97 | | - [NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14] |
98 | | - ``` |
99 | | - This log can be saved into a file with: |
100 | | - ```sh |
101 | | - kubectl logs megatron-train-my-training-v0-mpimaster-0 > training.log |
102 | | - ``` |
103 | | - and the performance analyzed with |
104 | | - ```sh |
105 | | - python3 utils/performance.py training.log |
106 | | - ``` |
107 | | - |
108 | | -## Potential Issues |
109 | | - |
110 | | -* **PyTorch can't resolve hostnames via c10d** |
111 | | - |
112 | | - If the rendezvous backend for PyTorch fails to connect to an OCI style |
113 | | - hostname for Kubernetes clusters, one work around this resolution failure by |
114 | | - augmenting `/etc/hosts` for every pod. |
115 | | - |
116 | | - For convenience, this is facilitated by enhancing `mpi.yaml` via |
117 | | - ```sh |
118 | | - ./utils/host_list.sh >> ./training/files/mpi.yaml |
119 | | - ``` |
120 | | - and afterwards reinstalling the training job via Helm. |
| 17 | +## Prerequisites & Docs |
| 18 | + |
| 19 | +### Prerequisites |
| 20 | + |
| 21 | +* An OCI tenancy with H100 GPU quota (shape: BM.GPU.H100.8). |
| 22 | +* A [Huggingface](https://huggingface.co/) account with a valid Auth Token. |
| 23 | +* SSH access to the deployed head node of your SLURM cluster. |
| 24 | + |
| 25 | +### Documentation & Resources |
| 26 | + |
| 27 | +* [DeepSpeed Documentation](https://www.deepspeed.ai/docs/) |
| 28 | +* [TinyLlama Model (HF)](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) |
| 29 | +* [Mistral LLMs](https://mistral.ai/technology/#models) |
| 30 | + |
| 31 | +## Model Training Workflow |
| 32 | + |
| 33 | +### Instance Configuration |
| 34 | + |
| 35 | +The deployment uses a cluster of `BM.GPU.H100.8` bare metal instances, provisioned with cluster networking and RDMA. |
| 36 | + |
| 37 | +The DeepSpeed job is submitted via SLURM using the `run_deepspeed.slurm` script. The environment includes a shared OCI File Storage System mounted on all nodes. |
| 38 | + |
| 39 | +### DeepSpeed Tuned Configuration |
| 40 | + |
| 41 | +The `tuned_ds_config.json` applies the following optimizations: |
| 42 | +- Switched from fp16 to bf16 (optimal for H100) |
| 43 | +- Enabled overlap_comm, contiguous_gradients, and increased bucket sizes |
| 44 | +- Used gradient_accumulation_steps=8 to balance memory use and throughput |
| 45 | +- Tweaked aio settings for better I/O performance during training |
| 46 | +- Removed optimizer/parameter offloading to fully utilize GPU RA |
| 47 | + |
| 48 | +These optimizations are benchmarked to deliver up to **13% faster training throughput** on OCI H100 clusters. |
| 49 | + |
| 50 | +### Launch Training Job |
| 51 | + |
| 52 | +Submit your training job using SLURM: |
| 53 | + |
| 54 | +```bash |
| 55 | +sbatch $HOME$/scripts/run_deepspeed.slurm |
| 56 | +``` |
| 57 | + |
| 58 | +The job script uses: |
| 59 | +- `train.py`: your LLM training script |
| 60 | +- `tuned_ds_config.json`: DeepSpeed configuration file |
| 61 | +- Local datasets and Hugging Face model/tokenizer |
| 62 | + |
| 63 | +### Example curl Test (after model fine-tuning) |
| 64 | + |
| 65 | +To serve the trained model via OpenAI-compatible API: |
| 66 | + |
| 67 | +```bash |
| 68 | +curl http://localhost:8000/v1/completions \ |
| 69 | + -H "Content-Type: application/json" \ |
| 70 | + -d '{ |
| 71 | + "model": "your-model-name", |
| 72 | + "prompt": "A GPU is a", |
| 73 | + "max_tokens": 128, |
| 74 | + "temperature": 0.7 |
| 75 | + }' |
| 76 | +``` |
| 77 | + |
| 78 | +## Notes |
| 79 | + |
| 80 | +To train larger models like Mixtral or Mistral 7B on H100, make sure to: |
| 81 | +- Scale the number of nodes appropriately |
| 82 | +- Use quantization or tensor parallelism when needed |
| 83 | +- Ensure models and datasets fit into GPU memory with DeepSpeed ZeRO optimization |
121 | 84 |
|
122 | 85 | # Acknowledgments |
123 | 86 |
|
124 | | -- **Author** - Matthias Wolf (GPU Solution Specialist) |
| 87 | +- **Author** - Deepak Soni (GPU Black Belt) |
125 | 88 |
|
126 | 89 | # License |
127 | 90 |
|
128 | 91 | Copyright (c) 2025 Oracle and/or its affiliates. |
129 | 92 |
|
130 | 93 | Licensed under the Universal Permissive License (UPL), Version 1.0. |
131 | 94 |
|
132 | | -See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details. |
| 95 | +See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details. |
0 commit comments