Skip to content

Conversation

@matz-e
Copy link
Member

@matz-e matz-e commented Mar 13, 2025

I have created a tutorial to demonstrate how to train LLM on OKE using BM machines with optimal performance using RDMA. Future improvements would be instructions for training on smaller shapes.

I have created a tutorial to demonstrate how to train LLM on OKE using BM
machines with optimal performance using RDMA.  Future improvements would be
instructions for training on smaller shapes.
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Mar 13, 2025
@matz-e matz-e requested a review from AlexanderHodicke March 13, 2025 10:17
@AlexanderHodicke
Copy link
Member

@oheimburger Please also have a look - code and config files

@matz-e matz-e marked this pull request as draft March 17, 2025 10:48
@matz-e
Copy link
Member Author

matz-e commented Mar 17, 2025

I have a reorganization incoming that will make this easier to use. Will remove draft status when ready again.

This should make it a lot simpler to get started.
@matz-e matz-e requested a review from oheimburger March 17, 2025 14:15
@matz-e matz-e marked this pull request as ready for review March 18, 2025 08:21
@matz-e matz-e changed the title Create reusable assets to train Llama 2 on OKE. Create reusable assets to train Llama 3 on OKE. Mar 18, 2025
@AlexanderHodicke AlexanderHodicke merged commit b738392 into main Mar 25, 2025
1 check passed
@matz-e matz-e deleted the nemo-megatron-training-on-k8s branch March 25, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants