generated from oracle-devrel/repo-template
-
Notifications
You must be signed in to change notification settings - Fork 83
Create reusable assets to train Llama 3 on OKE. #1627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
b08a6ad
Create reusable assets to train Llama 2 on OKE.
matz-e 6d19e72
Fix review date and first shape mention.
matz-e 026b217
Reorganize as a Helm chart.
matz-e 506993c
Add copyright lines.
matz-e 9e90718
Add PVC.
matz-e 55259d6
Increase storge for containers.
matz-e c70d5cc
Small updates to README.
matz-e 9877228
Merge branch 'main' into nemo-megatron-training-on-k8s
AlexanderHodicke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
35 changes: 35 additions & 0 deletions
35
cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/LICENSE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| Copyright (c) 2025 Oracle and/or its affiliates. | ||
|
|
||
| The Universal Permissive License (UPL), Version 1.0 | ||
|
|
||
| Subject to the condition set forth below, permission is hereby granted to any | ||
| person obtaining a copy of this software, associated documentation and/or data | ||
| (collectively the "Software"), free of charge and under any and all copyright | ||
| rights in the Software, and any and all patent rights owned or freely | ||
| licensable by each licensor hereunder covering either (i) the unmodified | ||
| Software as contributed to or provided by such licensor, or (ii) the Larger | ||
| Works (as defined below), to deal in both | ||
|
|
||
| (a) the Software, and | ||
| (b) any piece of software and/or hardware listed in the lrgrwrks.txt file if | ||
| one is included with the Software (each a "Larger Work" to which the Software | ||
| is contributed by such licensors), | ||
|
|
||
| without restriction, including without limitation the rights to copy, create | ||
| derivative works of, display, perform, and distribute the Software and make, | ||
| use, sell, offer for sale, import, export, have made, and have sold the | ||
| Software and the Larger Work(s), and to sublicense the foregoing rights on | ||
| either these or other terms. | ||
|
|
||
| This license is subject to the following condition: | ||
| The above copyright notice and either this complete permission notice or at | ||
| a minimum a reference to the UPL must be included in all copies or | ||
| substantial portions of the Software. | ||
|
|
||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
| SOFTWARE. |
132 changes: 132 additions & 0 deletions
132
...rastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| # Training LLMs with NVIDIA NeMo using Oracle Container Engine for Kubernetes | ||
|
|
||
| This repository demonstrates how to train LLM using | ||
| [NVIDIA NeMo](https://www.nvidia.com/en-gb/ai-data-science/products/nemo/) | ||
| on the Oracle Container Engine for Kubernetes (OKE) using | ||
| [NVIDIA Megatron](https://developer.nvidia.com/megatron-core). | ||
|
|
||
| Reference results from NVIDIA to train Llama 3 can be found on the | ||
| [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama3-dgxc-benchmarking). | ||
|
|
||
| Reviewed: 18.03.2025 | ||
|
|
||
| # When to use this asset? | ||
|
|
||
| * If you want to get started with training LLM like Llama 3 on Kubernetes using OCI. | ||
|
|
||
| # How to use this asset? | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| * You have access to an Orcale Cloud Tenancy. | ||
| * You have access to shapes with NVIDIA GPUs such as H100. | ||
| * You have a HuggingFace account and access to `meta-llama/Llama-3.1-8B-Instruct`. | ||
|
|
||
| This guide is loosely based on the | ||
| [NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html). | ||
|
|
||
| ## Infrastructure Setup | ||
|
|
||
| 1. Create an OKE cluster according | ||
| [to the instructions](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main#instructions-for-deploying-an-oke-cluster-with-gpus-and-rdma-connectivity), | ||
| importing one of the images and creating a GPU partition with BM.GPU.H100.8 nodes. | ||
|
|
||
| The configuration here assumes a minimum of 16 BM.GPU.H100.8 nodes. | ||
|
|
||
| - Ensure that the follwing setting is selected under the "OKE Cluster" section: | ||
|
|
||
| > Disable OKE GPU device plugin | ||
|
|
||
| as this tutorial will install the GPU operator later. | ||
|
|
||
| 2. Create a new File System for NFS, and modify the [persistent volume configuration in `pv.yaml`](./files/pv.yaml) to match. | ||
| Optimally, this will utilize High Performance Mount Targets (HMPT) as described in the following two whitepapers: | ||
| * [Scale Out OCI File Storage Performance for AI/ML and | ||
| Data-Intensive Workloads](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/scale-out-oci-file-storage-performance-for-data-intensive-workloads.pdf) | ||
| * [File Storage Performance Guide](https://docs.oracle.com/en-us/iaas/Content/Resources/Assets/whitepapers/file-storage-performance-guide.pdf) | ||
|
|
||
| 3. Install the NVIDIA GPU Operator according to | ||
| [NVIDIA NeMo Framework Launcher guide for Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/24.07/playbooks/kubernetes.html), then install the [Volcano scheduler](https://github.com/volcano-sh/volcano) with: | ||
| ```sh | ||
| kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml | ||
| ``` | ||
|
|
||
| 4. Copy the [files in this repository](./files) to the Kubernetes operator node. | ||
| You can download them from this repository via: | ||
| ```sh | ||
| BRANCH=main | ||
| curl -L https://github.com/oracle-devrel/technology-engineering/archive/refs/heads/${BRANCH}.tar.gz|tar xzf - --strip-components=6 technology-engineering-${BRANCH}/cloud-infrastructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files | ||
| ``` | ||
|
|
||
| Then modify the values in [`training/values.yaml`](./files/training/values.yaml) to match the storage server and export path. | ||
|
|
||
| 5. Mount the file system on the Kubernetes operator node. In the following, the mount location is assumed to be `/mnt/data/`. | ||
|
|
||
| ## Data Preparation and Training | ||
|
|
||
| 1. Download the tokenizer model from HuggingFace: | ||
| ```sh | ||
| mkdir -p /mnt/data/tokenizer | ||
| huggingface-cli login | ||
| huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json --local-dir /mnt/data/tokenizer | ||
| huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer.json --local-dir /mnt/data/tokenizer | ||
| ``` | ||
|
|
||
| 2. Apply the preprocessing job that will download and tokenize parts of the Pile dataset: | ||
| ```sh | ||
| helm install --set num_nodes=1 --set download_data=true "my-preprocessing" ./training | ||
| ``` | ||
|
|
||
| The progress can then be monitored by | ||
| ```sh | ||
| kubectl logs -f megatron-prep-my-preprocessing-mpimaster-0 | ||
| ``` | ||
|
|
||
| 3. Following successful preprocessing, the training can be started with: | ||
| ```sh | ||
| helm install --set num_nodes=1 "my-training-v0" ./training | ||
| ``` | ||
|
|
||
| The progress can then be monitored by | ||
| ```sh | ||
| kubectl logs -f megatron-train-my-training-v0-mpimaster-0 | ||
| ``` | ||
|
|
||
| 4. Calculate training throughput. For this, the following data is required from the training output: | ||
| ``` | ||
| [NeMo I 2025-03-10 16:24:43 perf_metrics_utils:42] train_step_timing in s: [7.13, 7.12, 7.12, 7.13, 7.13, 7.13, 7.12, 7.13, 7.14, 7.13, 7.14, 7.26, 7.13, 7.13, 7.13, 7.13, 7.15, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.13, 7.14, 7.14, 7.14, 7.14, 7.14] | ||
| ``` | ||
| This log can be saved into a file with: | ||
| ```sh | ||
| kubectl logs megatron-train-my-training-v0-mpimaster-0 > training.log | ||
| ``` | ||
| and the performance analyzed with | ||
| ```sh | ||
| python3 utils/performance.py training.log | ||
| ``` | ||
|
|
||
| ## Potential Issues | ||
|
|
||
| * **PyTorch can't resolve hostnames via c10d** | ||
|
|
||
| If the rendezvous backend for PyTorch fails to connect to an OCI style | ||
| hostname for Kubernetes clusters, one work around this resolution failure by | ||
| augmenting `/etc/hosts` for every pod. | ||
|
|
||
| For convenience, this is facilitated by enhancing `mpi.yaml` via | ||
| ```sh | ||
| ./utils/host_list.sh >> ./training/files/mpi.yaml | ||
| ``` | ||
| and afterwards reinstalling the training job via Helm. | ||
|
|
||
| # Acknowledgments | ||
|
|
||
| - **Author** - Matthias Wolf (GPU Solution Specialist) | ||
|
|
||
| # License | ||
|
|
||
| Copyright (c) 2025 Oracle and/or its affiliates. | ||
|
|
||
| Licensed under the Universal Permissive License (UPL), Version 1.0. | ||
|
|
||
| See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details. |
23 changes: 23 additions & 0 deletions
23
...ture/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/.helmignore
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # Patterns to ignore when building packages. | ||
| # This supports shell glob matching, relative path matching, and | ||
| # negation (prefixed with !). Only one pattern per line. | ||
| .DS_Store | ||
| # Common VCS dirs | ||
| .git/ | ||
| .gitignore | ||
| .bzr/ | ||
| .bzrignore | ||
| .hg/ | ||
| .hgignore | ||
| .svn/ | ||
| # Common backup files | ||
| *.swp | ||
| *.bak | ||
| *.tmp | ||
| *.orig | ||
| *~ | ||
| # Various IDEs | ||
| .project | ||
| .idea/ | ||
| *.tmproj | ||
| .vscode/ |
7 changes: 7 additions & 0 deletions
7
...cture/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/Chart.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # Copyright (c) 2025 Oracle and/or its affiliates. | ||
| apiVersion: v2 | ||
| name: training | ||
| description: A Helm chart to train LLM on Kubernetes using NVIDIA NeMo and Megatron | ||
| type: application | ||
| version: 0.1.0 | ||
| appVersion: "1.16.0" | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.