Create reusable assets to train Llama 3 on OKE. #1627

matz-e · 2025-03-13T10:15:29Z

I have created a tutorial to demonstrate how to train LLM on OKE using BM machines with optimal performance using RDMA. Future improvements would be instructions for training on smaller shapes.

AlexanderHodicke · 2025-03-13T10:18:48Z

@oheimburger Please also have a look - code and config files

matz-e · 2025-03-17T10:48:40Z

I have a reorganization incoming that will make this easier to use. Will remove draft status when ready again.

This should make it a lot simpler to get started.

...ructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/utils/performance.py

...structure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/utils/host_list.sh

...ructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/values.yaml

...nfra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/templates/training.yaml

...e/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/templates/pv.yaml

...ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/files/sort_hosts.sh

...u/ai-infrastructure/nemo-megatron-training-oke/files/training/files/config_llama3_8b_v2.yaml

...-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/files/config_llama3_8b.yaml

...gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/files/config_llama3_70b.yaml

...tructure/ai-infra-gpu/ai-infrastructure/nemo-megatron-training-oke/files/training/Chart.yaml

All done

Create reusable assets to train Llama 2 on OKE.

b08a6ad

I have created a tutorial to demonstrate how to train LLM on OKE using BM machines with optimal performance using RDMA. Future improvements would be instructions for training on smaller shapes.

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Mar 13, 2025

matz-e requested a review from AlexanderHodicke March 13, 2025 10:17

AlexanderHodicke requested a review from oheimburger March 13, 2025 10:18

Fix review date and first shape mention.

6d19e72

matz-e marked this pull request as draft March 17, 2025 10:48

Reorganize as a Helm chart.

026b217

This should make it a lot simpler to get started.

oheimburger previously requested changes Mar 17, 2025

View reviewed changes

Add copyright lines.

506993c

matz-e requested a review from oheimburger March 17, 2025 14:15

matz-e marked this pull request as ready for review March 18, 2025 08:21

Add PVC.

9e90718

matz-e changed the title ~~Create reusable assets to train Llama 2 on OKE.~~ Create reusable assets to train Llama 3 on OKE. Mar 18, 2025

matz-e added 2 commits March 18, 2025 13:44

Increase storge for containers.

55259d6

Small updates to README.

c70d5cc

AlexanderHodicke approved these changes Mar 25, 2025

View reviewed changes

Merge branch 'main' into nemo-megatron-training-on-k8s

9877228

AlexanderHodicke merged commit b738392 into main Mar 25, 2025
1 check passed

matz-e deleted the nemo-megatron-training-on-k8s branch March 25, 2025 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create reusable assets to train Llama 3 on OKE. #1627

Create reusable assets to train Llama 3 on OKE. #1627

Uh oh!

matz-e commented Mar 13, 2025

Uh oh!

AlexanderHodicke commented Mar 13, 2025

Uh oh!

matz-e commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Create reusable assets to train Llama 3 on OKE. #1627

Create reusable assets to train Llama 3 on OKE. #1627

Uh oh!

Conversation

matz-e commented Mar 13, 2025

Uh oh!

AlexanderHodicke commented Mar 13, 2025

Uh oh!

matz-e commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants