adding llamav3 support on slurm and EKS by allela-roy · Pull Request #737 · awslabs/awsome-distributed-training

allela-roy · 2025-06-11T13:45:44Z

Issue #, if available:

Description of changes:
added Llama 3 model configs and generated corresponding Slurm sbatch and EKS yaml scripts

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mhuguesaws · 2025-06-11T14:04:44Z

3.test_cases/pytorch/FSDP/models/llama3_1_70b.txt

+--num_layers=80
+--num_heads=64
+--model_type=llama_v3
+--tokenizer=hf-internal-testing/llama-tokenizer


Are we sure we can use this tokenizer for Llama3?

updated tokenizer for Llama 3 models.

mhuguesaws · 2025-06-11T14:05:21Z

3.test_cases/pytorch/FSDP/models/llama3_1_70b.txt

+--model_type=llama_v3
+--tokenizer=hf-internal-testing/llama-tokenizer
+--checkpoint_freq=5000
+--validation_freq=500


Please change the number steps, validation and checkpoint steps. This is running for too long other wise.
Let's 5 minutes of run time for workshops.

updated number steps, validation and checkpoint steps for Llama 3 models.

Changes are not reflected in the model files. Can you check?

mhuguesaws

Thank you for adding Llama3 models into FSDP test cases.
Left some comments. Can you please add documentation regarding how to run Llama3 models? Happy to discuss since it start to create some clutter in the README with so many models.
Or maybe you plan to add that in the documentation PR?

KeitaW · 2025-06-11T22:35:38Z

Perhaps we want to have sub-directory per model? https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/picotron

mhuguesaws

Need to change the model file to change the number of step. Please check if you have committed those changes.

allela-roy · 2025-06-13T07:29:12Z

Updated model files to update number of training steps.

mhuguesaws

Overall good pull request to add Llama3 models.
Can you add CI testing testing in https://github.com/aws-samples/awsome-distributed-training/blob/main/.github/workflows/fsdp-regression-test-container.yml#L23 and https://github.com/aws-samples/awsome-distributed-training/blob/main/.github/workflows/fsdp-regression-test-venv.yml#L23? Then we are good to merge.

allela-roy · 2025-06-16T12:20:13Z

Thanks, updated.

mhuguesaws · 2025-06-17T08:50:17Z

Waiting for tests to run before merging.

mhuguesaws · 2025-06-17T08:55:41Z

3.test_cases/pytorch/FSDP/slurm/llama3_1_8b-training.sbatch

+)
+
+export TORCHRUN=torchrun
+export TRAIN_SCRIPT=./train.py


Please rebase your PR on the latest FSDP change and regenerate the Slurm files.
The path to train.py is not correct for virtual environment settings and was fixed in #733.

mhuguesaws

Left comments. Please address. The CI is currently failing for venv configuration.

mhuguesaws · 2025-06-17T09:39:58Z

Other problem that the model does not run https://github.com/aws-samples/awsome-distributed-training/actions/runs/15692204475/job/44240922662

allela-roy · 2025-06-17T09:41:46Z

Rebasing and resolving the above error.

…ning on tokenizer

…ma3 models

mhuguesaws · 2025-06-18T06:50:14Z

Waiting on container tests to complete.

allela-roy mentioned this pull request Jun 11, 2025

Adding llamav3 FSDP sample-slurm #728

Closed

mhuguesaws reviewed Jun 11, 2025

View reviewed changes

mhuguesaws suggested changes Jun 11, 2025

View reviewed changes

mhuguesaws reviewed Jun 12, 2025

View reviewed changes

mhuguesaws reviewed Jun 15, 2025

View reviewed changes

mhuguesaws reviewed Jun 17, 2025

View reviewed changes

mhuguesaws suggested changes Jun 17, 2025

View reviewed changes

allela-roy added 4 commits June 17, 2025 09:44

adding llamav3 support on slurm and EKS

cd3e4c0

updating training steps, tokenizer for llama3 and removing legacy war…

dd1eae6

…ning on tokenizer

updating training steps on model files

fdd0f9d

rebasing PR to fix venv issues and validated tokenizer works with Lla…

288463b

…ma3 models

allela-roy force-pushed the llamav3 branch from 89f4369 to 288463b Compare June 17, 2025 10:22

updating Llama3 context length

bbe6fc7

mhuguesaws self-requested a review June 18, 2025 09:33

mhuguesaws approved these changes Jun 18, 2025

View reviewed changes

mhuguesaws merged commit bc7c21a into main Jun 18, 2025
24 of 90 checks passed

mhuguesaws deleted the llamav3 branch June 18, 2025 09:34

KeitaW pushed a commit that referenced this pull request Feb 17, 2026

adding llamav3 support on slurm and EKS (#737)

f30b1bf

Comments

Conversation

allela-roy commented Jun 11, 2025

Uh oh!

mhuguesaws Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

allela-roy Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

mhuguesaws Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

allela-roy Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

mhuguesaws Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

mhuguesaws left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Jun 11, 2025

Uh oh!

mhuguesaws left a comment

Choose a reason for hiding this comment

Uh oh!

allela-roy commented Jun 13, 2025

Uh oh!

mhuguesaws left a comment

Choose a reason for hiding this comment

Uh oh!

allela-roy commented Jun 16, 2025

Uh oh!

mhuguesaws commented Jun 17, 2025

Uh oh!

mhuguesaws Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

mhuguesaws left a comment

Choose a reason for hiding this comment

Uh oh!

mhuguesaws commented Jun 17, 2025

Uh oh!

allela-roy commented Jun 17, 2025

Uh oh!

mhuguesaws commented Jun 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mhuguesaws left a comment •

edited

Loading