deepspeed training tuning

deepakbsoni · deepakbsoni · commit 83cde75ff5bb · 2025-06-06T12:27:48.000+02:00
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/.DS_Store b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/.DS_Store
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/LICENSE b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/LICENSE
@@ -0,0 +1,35 @@
+Copyright (c) 2025 Oracle and/or its affiliates.
+
+The Universal Permissive License (UPL), Version 1.0
+
+Subject to the condition set forth below, permission is hereby granted to any
+person obtaining a copy of this software, associated documentation and/or data
+(collectively the "Software"), free of charge and under any and all copyright
+rights in the Software, and any and all patent rights owned or freely
+licensable by each licensor hereunder covering either (i) the unmodified
+Software as contributed to or provided by such licensor, or (ii) the Larger
+Works (as defined below), to deal in both
+
+(a) the Software, and
+(b) any piece of software and/or hardware listed in the lrgrwrks.txt file if
+one is included with the Software (each a "Larger Work" to which the Software
+is contributed by such licensors),
+
+without restriction, including without limitation the rights to copy, create
+derivative works of, display, perform, and distribute the Software and make,
+use, sell, offer for sale, import, export, have made, and have sold the
+Software and the Larger Work(s), and to sublicense the foregoing rights on
+either these or other terms.
+
+This license is subject to the following condition:
+The above copyright notice and either this complete permission notice or at
+a minimum a reference to the UPL must be included in all copies or
+substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/README.md b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/README.md
@@ -0,0 +1,132 @@
+# Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes
+
+This repository demonstrates how to train LLM using
+[NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/)
+on the Oracle Container Engine for Kubernetes (OKE) using
+[NVIDIA Megatron](https://developer.nvidia.com/megatron-core).
+
+Reference results from NVIDIA to train Llama 3 can be found on the
+[NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama3-dgxc-benchmarking).
+
+Reviewed: 18.03.2025
+
+# When to use this asset?
+
+* If you want to get started with training LLM like Llama 3 on Kubernetes using OCI.
+
+# How to use this asset?
+
+## Prerequisites
+
+* You have access to an Orcale Cloud Tenancy.
+* You have access to shapes with NVIDIA GPUs such as H100.
+* You have a HuggingFace account and access to `meta-llama/Llama-3.1-8B-Instruct`.
+
+This guide is loosely based on the
+[NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html).
+
+## Infrastructure Setup
+
+1. Create an OKE cluster according
+   [to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity),
+   importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes.
+
+   The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes.
+
+   - Ensure that the follwing setting is selected under the "OKE Cluster" section:
+
+     > Disable OKE GPU device plugin
+
+     as this tutorial will install the GPU operator later.
+
+2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match.
+   Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers:
+   * [Scale Out OCI File Storage Performance for AI/ML and
+Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf)
+   * [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf)
+
+3. Install the NVIDIA GPU Operator according to
+   [NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html), then install the [Volcano scheduler](https://github.com/volcano-sh/volcano) with:
+   ```sh
+   kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
+   ```
+
+4. Copy the [files in this repository](./files) to the Kubernetes operator node.
+   You can download them from this repository via:
+   ```sh
+   BRANCH=main
+   curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/${BRANCH}.tar.gz|tar xzf - --strip-components=6 technology-engineering-${BRANCH}/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files
+   ```
+   
+   Then modify the values in [`training/values.yaml`](./files/training/values.yaml) to match the storage server and export path.
+
+5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`.
+
+## Data Preparation and Training
+
+1. Download the tokenizer model from HuggingFace:
+   ```sh
+   mkdir -p /mnt/data/tokenizer
+   huggingface-cli login
+   huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer
+   huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer.json --local-dir /mnt/data/tokenizer
+   ```
+
+2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset:
+   ```sh
+   helm install --set num_nodes=1 --set download_data=true "my-preprocessing" ./training
+   ```
+
+   The progress can then be monitored by
+   ```sh
+   kubectl logs -f megatron-prep-my-preprocessing-mpimaster-0
+   ```
+
+3. Following successful preprocessing, the training can be started with:
+   ```sh
+   helm install --set num_nodes=1 "my-training-v0" ./training
+   ```
+
+   The progress can then be monitored by
+   ```sh
+   kubectl logs -f megatron-train-my-training-v0-mpimaster-0
+   ```
+
+4. Calculate training throughput. For this, the following data is required from the training output:
+   ```
+   [NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14]
+   ```
+   This log can be saved into a file with:
+   ```sh
+   kubectl logs  megatron-train-my-training-v0-mpimaster-0 > training.log
+   ```
+   and the performance analyzed with
+   ```sh
+   python3 utils/performance.py training.log
+   ```
+
+## Potential Issues
+
+* **PyTorch can't resolve hostnames via c10d**
+
+  If the rendezvous backend for PyTorch fails to connect to an OCI style
+  hostname for Kubernetes clusters, one work around this resolution failure by
+  augmenting `/etc/hosts` for every pod.
+
+  For convenience, this is facilitated by enhancing `mpi.yaml` via
+  ```sh
+  ./utils/host_list.sh >> ./training/files/mpi.yaml
+  ```
+  and afterwards reinstalling the training job via Helm.
+
+# Acknowledgments
+
+- **Author** - Matthias Wolf (GPU Solution Specialist)
+
+# License
+ 
+Copyright (c) 2025 Oracle and/or its affiliates.
+ 
+Licensed under the Universal Permissive License (UPL), Version 1.0.
+ 
+See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/.DS_Store b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/.DS_Store
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/README.md b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/README.md
@@ -0,0 +1,33 @@
+# DeepSpeed LLM Training on OCI H100 SLURM Cluster
+
+This repository automates deployment of a multi-node SLURM cluster with RDMA-enabled H100 GPUs on OCI for training large language models using DeepSpeed.
+
+## 🔧 Tuned Configuration
+
+The `tuned_ds_config.json` includes:
+- Mixed precision (fp16) with loss scaling
+- ZeRO Stage 2 optimization with overlapping communication
+- Optimized AdamW with increased learning rate
+- Activation checkpointing
+- Gradient accumulation for batch size scaling
+
+📈 This configuration delivers up to **13% more training throughput** versus default settings on OCI H100 infrastructure.
+
+## 📂 Contents
+
+- `scripts/tuned_ds_config.json` – optimized DeepSpeed configuration
+- `scripts/run_deepspeed.slurm` – job script for SLURM
+- `README.md` – usage overview and tuning explanation
+
+## 🚀 Usage
+
+1. Deploy SLURM H100 cluster on OCI
+2. SSH to master node
+3. Submit the job:
+
+```bash
+sbatch /mnt/deepspeed/scripts/run_deepspeed.slurm
+```
+
+Model output and logs will be written to `/mnt/deepspeed/output`.
+
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/exec_torchrun.sh b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/exec_torchrun.sh
@@ -0,0 +1,57 @@
+#!/bin/bash
+
+set -ex
+
+source myenv/bin/activate
+
+export NCCL_TIMEOUT=1800
+
+export NCCL_IGNORE_CPU_AFFINITY=1
+export OMPI_MCA_coll_hcol_enable=0
+export NCCL_CROSS_NIC=2
+export NCCL_SOCKET_NTHREADS=16
+export NCCL_DEBUG=DEBUG
+export NCCL_CUMEM_ENABLE=0
+export NCCL_IB_SPLIT_DATA_ON_QPS=0
+export NCCL_IB_QPS_PER_CONNECTION=16
+export NCCL_IB_GID_INDEX=3
+export NCCL_IB_HCA="mlx5_0,mlx5_1,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17"
+export NCCL_IB_TC=41
+export NCCL_IB_SL=0
+export NCCL_IB_TIMEOUT=22
+export HCOLL_ENABLE_MCAST_ALL=0
+export UCX_TLS=tcp
+export UCX_NET_DEVICES=eth0
+export RX_QUEUE_LEN=8192
+export NCCL_SOCKET_IFNAME=eth0 
+
+export OMP_NUM_THREADS=16  # should be optimally number of CPU cores / number of GPUs per node
+
+export GPUS_PER_NODE=8
+MASTER_NODE=$(scontrol show hostname | head -n 1)
+export MASTER_ADDR=$(scontrol show node=$MASTER_NODE | awk -F= '/NodeAddr=/{print $2}' | awk '{print $1}')
+export NNODES=$SLURM_NTASKS
+export NODE_RANK=$SLURM_NODEID
+export MASTER_PORT=9001
+export WORLD_SIZE_JOB=$SLURM_NTASKS
+export DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT "
+
+torchrun $DISTRIBUTED_ARGS \
+	train.py \
+	--model_config tiny_llama_1.1B_config.json \
+	--tokenizer_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+	--dataset_mixer data_mixer.json \
+	--dataset_name mix \
+	--dataset_type local \
+	--dataset_packed \
+	--batch_size 12 \
+	--gradient_checkpointing \
+	--max_train_steps 1000000 \
+	--val_after_steps 10000 \
+	--num_warmup_steps 10000 \
+	--learning_rate 1e-4 \
+	--num_gpus_node $GPUS_PER_NODE \
+	--gradient_clipping 1 \
+	--gradient_accumulation_steps 2 \
+	--dataset_cache "./hf-cache"
+
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/run_deepspeed.slurm b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/run_deepspeed.slurm
@@ -0,0 +1,18 @@
+#!/bin/bash
+#SBATCH --job-name=ds-train
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=8
+#SBATCH --cpus-per-task=8
+#SBATCH --gres=gpu:8
+#SBATCH --time=06:00:00
+#SBATCH --output=ds_output.log
+#SBATCH --error=ds_error.log
+
+export NCCL_DEBUG=INFO
+export OMP_NUM_THREADS=8
+
+deepspeed --num_gpus=8 /mnt/deepspeed/DeepSpeed/examples/llama_finetune.py \
+  --deepspeed /mnt/deepspeed/scripts/tuned_ds_config.json \
+  --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \
+  --train_file /mnt/deepspeed/data/train.json \
+  --output_dir /mnt/deepspeed/output
diff --git a/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/tuned_ds_config.json b/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/deepspeed-training-tuning/files/scripts/tuned_ds_config.json
@@ -0,0 +1,37 @@
+{
+  "train_batch_size": 128,
+  "train_micro_batch_size_per_gpu": 8,
+  "gradient_accumulation_steps": 2,
+  "steps_per_print": 20,
+  "optimizer": {
+    "type": "AdamW",
+    "params": {
+      "lr": 2e-4,
+      "betas": [0.9, 0.999],
+      "eps": 1e-8,
+      "weight_decay": 1e-2
+    }
+  },
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "initial_scale_power": 16,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "gradient_clipping": 1.0,
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients": true
+  },
+  "activation_checkpointing": {
+    "partition_activations": true,
+    "contiguous_memory_optimization": true
+  },
+  "wall_clock_breakdown": true
+}