Skip to content

Commit 7e78d59

Browse files
committed
update the configuration
1 parent 83cde75 commit 7e78d59

File tree

5 files changed

+166
-155
lines changed

5 files changed

+166
-155
lines changed
Lines changed: 77 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -1,132 +1,95 @@
1-
# Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes
1+
# Overview
22

3-
This repository demonstrates how to train LLM using
4-
[NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/)
5-
on the Oracle Container Engine for Kubernetes (OKE) using
6-
[NVIDIA Megatron](https://developer.nvidia.com/megatron-core).
3+
This repository provides a step-by-step deployment of DeepSpeed training for Large Language Models (LLMs) on Oracle Cloud Infrastructure (OCI), using H100 GPU clusters with RDMA and SLURM.
74

8-
Reference results from NVIDIA to train Llama 3 can be found on the
9-
[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama3-dgxc-benchmarking).
10-
11-
Reviewed: 18.03.2025
5+
This setup includes a tuned DeepSpeed configuration (`tuned_ds_config.json`) that provides up to **13% performance improvement** over standard configurations.
126

7+
Reviewed: 06.06.2025
138
# When to use this asset?
149

15-
* If you want to get started with training LLM like Llama 3 on Kubernetes using OCI.
10+
Use this asset when you need to:
11+
- Train large-scale language models on OCI with H100 hardware
12+
- Utilize RDMA-enabled SLURM clusters for distributed multi-node DeepSpeed training
13+
- Achieve improved throughput via custom-tuned DeepSpeed JSON configs
1614

1715
# How to use this asset?
1816

19-
## Prerequisites
20-
21-
* You have access to an Orcale Cloud Tenancy.
22-
* You have access to shapes with NVIDIA GPUs such as H100.
23-
* You have a HuggingFace account and access to `meta-llama/Llama-3.1-8B-Instruct`.
24-
25-
This guide is loosely based on the
26-
[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html).
27-
28-
## Infrastructure Setup
29-
30-
1. Create an OKE cluster according
31-
[to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity),
32-
importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes.
33-
34-
The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes.
35-
36-
- Ensure that the follwing setting is selected under the "OKE Cluster" section:
37-
38-
> Disable OKE GPU device plugin
39-
40-
as this tutorial will install the GPU operator later.
41-
42-
2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match.
43-
Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers:
44-
* [Scale Out OCI File Storage Performance for AI/ML and
45-
Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf)
46-
* [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf)
47-
48-
3. Install the NVIDIA GPU Operator according to
49-
[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html), then install the [Volcano scheduler](https://github.com/volcano-sh/volcano) with:
50-
```sh
51-
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
52-
```
53-
54-
4. Copy the [files in this repository](./files) to the Kubernetes operator node.
55-
You can download them from this repository via:
56-
```sh
57-
BRANCH=main
58-
curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/${BRANCH}.tar.gz|tar xzf - --strip-components=6 technology-engineering-${BRANCH}/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files
59-
```
60-
61-
Then modify the values in [`training/values.yaml`](./files/training/values.yaml) to match the storage server and export path.
62-
63-
5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`.
64-
65-
## Data Preparation and Training
66-
67-
1. Download the tokenizer model from HuggingFace:
68-
```sh
69-
mkdir -p /mnt/data/tokenizer
70-
huggingface-cli login
71-
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer
72-
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer.json --local-dir /mnt/data/tokenizer
73-
```
74-
75-
2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset:
76-
```sh
77-
helm install --set num_nodes=1 --set download_data=true "my-preprocessing" ./training
78-
```
79-
80-
The progress can then be monitored by
81-
```sh
82-
kubectl logs -f megatron-prep-my-preprocessing-mpimaster-0
83-
```
84-
85-
3. Following successful preprocessing, the training can be started with:
86-
```sh
87-
helm install --set num_nodes=1 "my-training-v0" ./training
88-
```
89-
90-
The progress can then be monitored by
91-
```sh
92-
kubectl logs -f megatron-train-my-training-v0-mpimaster-0
93-
```
94-
95-
4. Calculate training throughput. For this, the following data is required from the training output:
96-
```
97-
[NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14]
98-
```
99-
This log can be saved into a file with:
100-
```sh
101-
kubectl logs megatron-train-my-training-v0-mpimaster-0 > training.log
102-
```
103-
and the performance analyzed with
104-
```sh
105-
python3 utils/performance.py training.log
106-
```
107-
108-
## Potential Issues
109-
110-
* **PyTorch can't resolve hostnames via c10d**
111-
112-
If the rendezvous backend for PyTorch fails to connect to an OCI style
113-
hostname for Kubernetes clusters, one work around this resolution failure by
114-
augmenting `/etc/hosts` for every pod.
115-
116-
For convenience, this is facilitated by enhancing `mpi.yaml` via
117-
```sh
118-
./utils/host_list.sh >> ./training/files/mpi.yaml
119-
```
120-
and afterwards reinstalling the training job via Helm.
17+
## Prerequisites & Docs
18+
19+
### Prerequisites
20+
21+
* An OCI tenancy with H100 GPU quota (shape: BM.GPU.H100.8).
22+
* A [Huggingface](https://huggingface.co/) account with a valid Auth Token.
23+
* SSH access to the deployed head node of your SLURM cluster.
24+
25+
### Documentation & Resources
26+
27+
* [DeepSpeed Documentation](https://www.deepspeed.ai/docs/)
28+
* [TinyLlama Model (HF)](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
29+
* [Mistral LLMs](https://mistral.ai/technology/#models)
30+
31+
## Model Training Workflow
32+
33+
### Instance Configuration
34+
35+
The deployment uses a cluster of `BM.GPU.H100.8` bare metal instances, provisioned with cluster networking and RDMA.
36+
37+
The DeepSpeed job is submitted via SLURM using the `run_deepspeed.slurm` script. The environment includes a shared OCI File Storage System mounted on all nodes.
38+
39+
### DeepSpeed Tuned Configuration
40+
41+
The `tuned_ds_config.json` applies the following optimizations:
42+
- Switched from fp16 to bf16 (optimal for H100)
43+
- Enabled overlap_comm, contiguous_gradients, and increased bucket sizes
44+
- Used gradient_accumulation_steps=8 to balance memory use and throughput
45+
- Tweaked aio settings for better I/O performance during training
46+
- Removed optimizer/parameter offloading to fully utilize GPU RA
47+
48+
These optimizations are benchmarked to deliver up to **13% faster training throughput** on OCI H100 clusters.
49+
50+
### Launch Training Job
51+
52+
Submit your training job using SLURM:
53+
54+
```bash
55+
sbatch $HOME$/scripts/run_deepspeed.slurm
56+
```
57+
58+
The job script uses:
59+
- `train.py`: your LLM training script
60+
- `tuned_ds_config.json`: DeepSpeed configuration file
61+
- Local datasets and Hugging Face model/tokenizer
62+
63+
### Example curl Test (after model fine-tuning)
64+
65+
To serve the trained model via OpenAI-compatible API:
66+
67+
```bash
68+
curl http://localhost:8000/v1/completions \
69+
-H "Content-Type: application/json" \
70+
-d '{
71+
"model": "your-model-name",
72+
"prompt": "A GPU is a",
73+
"max_tokens": 128,
74+
"temperature": 0.7
75+
}'
76+
```
77+
78+
## Notes
79+
80+
To train larger models like Mixtral or Mistral 7B on H100, make sure to:
81+
- Scale the number of nodes appropriately
82+
- Use quantization or tensor parallelism when needed
83+
- Ensure models and datasets fit into GPU memory with DeepSpeed ZeRO optimization
12184

12285
# Acknowledgments
12386

124-
- **Author** - Matthias Wolf (GPU Solution Specialist)
87+
- **Author** - Deepak Soni (GPU Black Belt)
12588

12689
# License
12790

12891
Copyright (c) 2025 Oracle and/or its affiliates.
12992

13093
Licensed under the Universal Permissive License (UPL), Version 1.0.
13194

132-
See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.
95+
See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/README.md

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,30 +4,53 @@ This repository automates deployment of a multi-node SLURM cluster with RDMA-ena
44

55
## 🔧 Tuned Configuration
66

7+
We developed a custom-tuned deepspeed_config.json tailored for:
8+
- Multi-node training
9+
- RDMA-aware NCCL backend
10+
- H100’s bfloat16-optimized tensor cores
11+
- DeepSpeed ZeRO Stage 2 with communication overlap
12+
713
The `tuned_ds_config.json` includes:
8-
- Mixed precision (fp16) with loss scaling
9-
- ZeRO Stage 2 optimization with overlapping communication
10-
- Optimized AdamW with increased learning rate
11-
- Activation checkpointing
12-
- Gradient accumulation for batch size scaling
14+
- Switched from fp16 to bf16 (optimal for H100)
15+
- Enabled overlap_comm, contiguous_gradients, and increased bucket sizes
16+
- Used gradient_accumulation_steps=8 to balance memory use and throughput
17+
- Tweaked aio settings for better I/O performance during training
18+
- Removed optimizer/parameter offloading to fully utilize GPU RAM
19+
1320

1421
📈 This configuration delivers up to **13% more training throughput** versus default settings on OCI H100 infrastructure.
1522

23+
## With this updated configuration:
24+
- Training throughput improved by ~13%
25+
- GPU utilization increased more consistently across all 8 nodes
26+
- Communication latency reduced on RDMA fabric
27+
- No stability or memory issues observed with ZeRO Stage 2
28+
1629
## 📂 Contents
1730

1831
- `scripts/tuned_ds_config.json` – optimized DeepSpeed configuration
1932
- `scripts/run_deepspeed.slurm` – job script for SLURM
2033
- `README.md` – usage overview and tuning explanation
2134

22-
## 🚀 Usage
35+
## Usage
2336

2437
1. Deploy SLURM H100 cluster on OCI
2538
2. SSH to master node
2639
3. Submit the job:
2740

2841
```bash
29-
sbatch /mnt/deepspeed/scripts/run_deepspeed.slurm
42+
sbatch run_deepspeed.slurm
3043
```
3144

32-
Model output and logs will be written to `/mnt/deepspeed/output`.
45+
Model output and logs will be written to `$HOME/output`.
46+
47+
## Conclusion
48+
- NCCL tuning alone isn’t always sufficient — framework-level configuration (DeepSpeed) must align with hardware.
49+
- H100 GPUs benefit significantly from bfloat16 and increased comm overlap.
50+
- ZeRO Stage 2 provided a solid balance of memory efficiency and speed. ZeRO-3 is reserved for future scaling.
51+
- System-aware configuration (bucket sizes, threading, and memory layout) is essential for reaching peak performance.
3352

53+
## Next Steps
54+
- Benchmark with ZeRO Stage 3 for models approaching GPU memory limits.
55+
- Test pipeline parallelism on >16 node jobs.
56+
- Evaluate DeepSpeed 0.13+ features such as NVMe offloading and optimizer fusion on upcoming jobs.

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/exec_torchrun.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ set -ex
44

55
source myenv/bin/activate
66

7+
## NCCL parameters configuration based on OCI H100 GPU Instance deployment
78
export NCCL_TIMEOUT=1800
89

910
export NCCL_IGNORE_CPU_AFFINITY=1
@@ -38,7 +39,7 @@ export DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node
3839

3940
torchrun $DISTRIBUTED_ARGS \
4041
train.py \
41-
--model_config tiny_llama_1.1B_config.json \
42+
--model_config tuned_ds_config.json \
4243
--tokenizer_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
4344
--dataset_mixer data_mixer.json \
4445
--dataset_name mix \
Lines changed: 4 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,6 @@
11
#!/bin/bash
2-
#SBATCH --job-name=ds-train
3-
#SBATCH --nodes=2
4-
#SBATCH --ntasks-per-node=8
5-
#SBATCH --cpus-per-task=8
6-
#SBATCH --gres=gpu:8
7-
#SBATCH --time=06:00:00
8-
#SBATCH --output=ds_output.log
9-
#SBATCH --error=ds_error.log
102

11-
export NCCL_DEBUG=INFO
12-
export OMP_NUM_THREADS=8
13-
14-
deepspeed --num_gpus=8 /mnt/deepspeed/DeepSpeed/examples/llama_finetune.py \
15-
--deepspeed /mnt/deepspeed/scripts/tuned_ds_config.json \
16-
--model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \
17-
--train_file /mnt/deepspeed/data/train.json \
18-
--output_dir /mnt/deepspeed/output
3+
#SBATCH --nodes=4
4+
#SBATCH --job-name=deepspeed-performance-test
5+
#SBATCH --exclusive
6+
srun -l exec_torchrun.sh

0 commit comments

Comments
 (0)