Skip to content

Commit 83cde75

Browse files
committed
deepspeed training tuning
1 parent dc9c0fa commit 83cde75

File tree

8 files changed

+312
-0
lines changed

8 files changed

+312
-0
lines changed
Binary file not shown.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
Copyright (c) 2025 Oracle and/or its affiliates.
2+
3+
The Universal Permissive License (UPL), Version 1.0
4+
5+
Subject to the condition set forth below, permission is hereby granted to any
6+
person obtaining a copy of this software, associated documentation and/or data
7+
(collectively the "Software"), free of charge and under any and all copyright
8+
rights in the Software, and any and all patent rights owned or freely
9+
licensable by each licensor hereunder covering either (i) the unmodified
10+
Software as contributed to or provided by such licensor, or (ii) the Larger
11+
Works (as defined below), to deal in both
12+
13+
(a) the Software, and
14+
(b) any piece of software and/or hardware listed in the lrgrwrks.txt file if
15+
one is included with the Software (each a "Larger Work" to which the Software
16+
is contributed by such licensors),
17+
18+
without restriction, including without limitation the rights to copy, create
19+
derivative works of, display, perform, and distribute the Software and make,
20+
use, sell, offer for sale, import, export, have made, and have sold the
21+
Software and the Larger Work(s), and to sublicense the foregoing rights on
22+
either these or other terms.
23+
24+
This license is subject to the following condition:
25+
The above copyright notice and either this complete permission notice or at
26+
a minimum a reference to the UPL must be included in all copies or
27+
substantial portions of the Software.
28+
29+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
30+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
31+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
32+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
33+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
34+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
35+
SOFTWARE.
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes
2+
3+
This repository demonstrates how to train LLM using
4+
[NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/)
5+
on the Oracle Container Engine for Kubernetes (OKE) using
6+
[NVIDIA Megatron](https://developer.nvidia.com/megatron-core).
7+
8+
Reference results from NVIDIA to train Llama 3 can be found on the
9+
[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama3-dgxc-benchmarking).
10+
11+
Reviewed: 18.03.2025
12+
13+
# When to use this asset?
14+
15+
* If you want to get started with training LLM like Llama 3 on Kubernetes using OCI.
16+
17+
# How to use this asset?
18+
19+
## Prerequisites
20+
21+
* You have access to an Orcale Cloud Tenancy.
22+
* You have access to shapes with NVIDIA GPUs such as H100.
23+
* You have a HuggingFace account and access to `meta-llama/Llama-3.1-8B-Instruct`.
24+
25+
This guide is loosely based on the
26+
[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html).
27+
28+
## Infrastructure Setup
29+
30+
1. Create an OKE cluster according
31+
[to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity),
32+
importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes.
33+
34+
The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes.
35+
36+
- Ensure that the follwing setting is selected under the "OKE Cluster" section:
37+
38+
> Disable OKE GPU device plugin
39+
40+
as this tutorial will install the GPU operator later.
41+
42+
2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match.
43+
Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers:
44+
* [Scale Out OCI File Storage Performance for AI/ML and
45+
Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf)
46+
* [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf)
47+
48+
3. Install the NVIDIA GPU Operator according to
49+
[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html), then install the [Volcano scheduler](https://github.com/volcano-sh/volcano) with:
50+
```sh
51+
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
52+
```
53+
54+
4. Copy the [files in this repository](./files) to the Kubernetes operator node.
55+
You can download them from this repository via:
56+
```sh
57+
BRANCH=main
58+
curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/${BRANCH}.tar.gz|tar xzf - --strip-components=6 technology-engineering-${BRANCH}/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files
59+
```
60+
61+
Then modify the values in [`training/values.yaml`](./files/training/values.yaml) to match the storage server and export path.
62+
63+
5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`.
64+
65+
## Data Preparation and Training
66+
67+
1. Download the tokenizer model from HuggingFace:
68+
```sh
69+
mkdir -p /mnt/data/tokenizer
70+
huggingface-cli login
71+
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer
72+
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer.json --local-dir /mnt/data/tokenizer
73+
```
74+
75+
2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset:
76+
```sh
77+
helm install --set num_nodes=1 --set download_data=true "my-preprocessing" ./training
78+
```
79+
80+
The progress can then be monitored by
81+
```sh
82+
kubectl logs -f megatron-prep-my-preprocessing-mpimaster-0
83+
```
84+
85+
3. Following successful preprocessing, the training can be started with:
86+
```sh
87+
helm install --set num_nodes=1 "my-training-v0" ./training
88+
```
89+
90+
The progress can then be monitored by
91+
```sh
92+
kubectl logs -f megatron-train-my-training-v0-mpimaster-0
93+
```
94+
95+
4. Calculate training throughput. For this, the following data is required from the training output:
96+
```
97+
[NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14]
98+
```
99+
This log can be saved into a file with:
100+
```sh
101+
kubectl logs megatron-train-my-training-v0-mpimaster-0 > training.log
102+
```
103+
and the performance analyzed with
104+
```sh
105+
python3 utils/performance.py training.log
106+
```
107+
108+
## Potential Issues
109+
110+
* **PyTorch can't resolve hostnames via c10d**
111+
112+
If the rendezvous backend for PyTorch fails to connect to an OCI style
113+
hostname for Kubernetes clusters, one work around this resolution failure by
114+
augmenting `/etc/hosts` for every pod.
115+
116+
For convenience, this is facilitated by enhancing `mpi.yaml` via
117+
```sh
118+
./utils/host_list.sh >> ./training/files/mpi.yaml
119+
```
120+
and afterwards reinstalling the training job via Helm.
121+
122+
# Acknowledgments
123+
124+
- **Author** - Matthias Wolf (GPU Solution Specialist)
125+
126+
# License
127+
128+
Copyright (c) 2025 Oracle and/or its affiliates.
129+
130+
Licensed under the Universal Permissive License (UPL), Version 1.0.
131+
132+
See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.
Binary file not shown.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# DeepSpeed LLM Training on OCI H100 SLURM Cluster
2+
3+
This repository automates deployment of a multi-node SLURM cluster with RDMA-enabled H100 GPUs on OCI for training large language models using DeepSpeed.
4+
5+
## 🔧 Tuned Configuration
6+
7+
The `tuned_ds_config.json` includes:
8+
- Mixed precision (fp16) with loss scaling
9+
- ZeRO Stage 2 optimization with overlapping communication
10+
- Optimized AdamW with increased learning rate
11+
- Activation checkpointing
12+
- Gradient accumulation for batch size scaling
13+
14+
📈 This configuration delivers up to **13% more training throughput** versus default settings on OCI H100 infrastructure.
15+
16+
## 📂 Contents
17+
18+
- `scripts/tuned_ds_config.json` – optimized DeepSpeed configuration
19+
- `scripts/run_deepspeed.slurm` – job script for SLURM
20+
- `README.md` – usage overview and tuning explanation
21+
22+
## 🚀 Usage
23+
24+
1. Deploy SLURM H100 cluster on OCI
25+
2. SSH to master node
26+
3. Submit the job:
27+
28+
```bash
29+
sbatch /mnt/deepspeed/scripts/run_deepspeed.slurm
30+
```
31+
32+
Model output and logs will be written to `/mnt/deepspeed/output`.
33+
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#!/bin/bash
2+
3+
set -ex
4+
5+
source myenv/bin/activate
6+
7+
export NCCL_TIMEOUT=1800
8+
9+
export NCCL_IGNORE_CPU_AFFINITY=1
10+
export OMPI_MCA_coll_hcol_enable=0
11+
export NCCL_CROSS_NIC=2
12+
export NCCL_SOCKET_NTHREADS=16
13+
export NCCL_DEBUG=DEBUG
14+
export NCCL_CUMEM_ENABLE=0
15+
export NCCL_IB_SPLIT_DATA_ON_QPS=0
16+
export NCCL_IB_QPS_PER_CONNECTION=16
17+
export NCCL_IB_GID_INDEX=3
18+
export NCCL_IB_HCA="mlx5_0,mlx5_1,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17"
19+
export NCCL_IB_TC=41
20+
export NCCL_IB_SL=0
21+
export NCCL_IB_TIMEOUT=22
22+
export HCOLL_ENABLE_MCAST_ALL=0
23+
export UCX_TLS=tcp
24+
export UCX_NET_DEVICES=eth0
25+
export RX_QUEUE_LEN=8192
26+
export NCCL_SOCKET_IFNAME=eth0
27+
28+
export OMP_NUM_THREADS=16 # should be optimally number of CPU cores / number of GPUs per node
29+
30+
export GPUS_PER_NODE=8
31+
MASTER_NODE=$(scontrol show hostname | head -n 1)
32+
export MASTER_ADDR=$(scontrol show node=$MASTER_NODE | awk -F= '/NodeAddr=/{print $2}' | awk '{print $1}')
33+
export NNODES=$SLURM_NTASKS
34+
export NODE_RANK=$SLURM_NODEID
35+
export MASTER_PORT=9001
36+
export WORLD_SIZE_JOB=$SLURM_NTASKS
37+
export DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT "
38+
39+
torchrun $DISTRIBUTED_ARGS \
40+
train.py \
41+
--model_config tiny_llama_1.1B_config.json \
42+
--tokenizer_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
43+
--dataset_mixer data_mixer.json \
44+
--dataset_name mix \
45+
--dataset_type local \
46+
--dataset_packed \
47+
--batch_size 12 \
48+
--gradient_checkpointing \
49+
--max_train_steps 1000000 \
50+
--val_after_steps 10000 \
51+
--num_warmup_steps 10000 \
52+
--learning_rate 1e-4 \
53+
--num_gpus_node $GPUS_PER_NODE \
54+
--gradient_clipping 1 \
55+
--gradient_accumulation_steps 2 \
56+
--dataset_cache "./hf-cache"
57+
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/bin/bash
2+
#SBATCH --job-name=ds-train
3+
#SBATCH --nodes=2
4+
#SBATCH --ntasks-per-node=8
5+
#SBATCH --cpus-per-task=8
6+
#SBATCH --gres=gpu:8
7+
#SBATCH --time=06:00:00
8+
#SBATCH --output=ds_output.log
9+
#SBATCH --error=ds_error.log
10+
11+
export NCCL_DEBUG=INFO
12+
export OMP_NUM_THREADS=8
13+
14+
deepspeed --num_gpus=8 /mnt/deepspeed/DeepSpeed/examples/llama_finetune.py \
15+
--deepspeed /mnt/deepspeed/scripts/tuned_ds_config.json \
16+
--model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \
17+
--train_file /mnt/deepspeed/data/train.json \
18+
--output_dir /mnt/deepspeed/output
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
{
2+
"train_batch_size": 128,
3+
"train_micro_batch_size_per_gpu": 8,
4+
"gradient_accumulation_steps": 2,
5+
"steps_per_print": 20,
6+
"optimizer": {
7+
"type": "AdamW",
8+
"params": {
9+
"lr": 2e-4,
10+
"betas": [0.9, 0.999],
11+
"eps": 1e-8,
12+
"weight_decay": 1e-2
13+
}
14+
},
15+
"fp16": {
16+
"enabled": true,
17+
"loss_scale": 0,
18+
"initial_scale_power": 16,
19+
"hysteresis": 2,
20+
"min_loss_scale": 1
21+
},
22+
"gradient_clipping": 1.0,
23+
"zero_optimization": {
24+
"stage": 2,
25+
"allgather_partitions": true,
26+
"allgather_bucket_size": 5e8,
27+
"overlap_comm": true,
28+
"reduce_scatter": true,
29+
"reduce_bucket_size": 5e8,
30+
"contiguous_gradients": true
31+
},
32+
"activation_checkpointing": {
33+
"partition_activations": true,
34+
"contiguous_memory_optimization": true
35+
},
36+
"wall_clock_breakdown": true
37+
}

0 commit comments

Comments
 (0)