Skip to content

Comments

adding llamav3 support on slurm and EKS#737

Merged
mhuguesaws merged 5 commits intomainfrom
llamav3
Jun 18, 2025
Merged

adding llamav3 support on slurm and EKS#737
mhuguesaws merged 5 commits intomainfrom
llamav3

Conversation

@allela-roy
Copy link
Contributor

Issue #, if available:

Description of changes:
added Llama 3 model configs and generated corresponding Slurm sbatch and EKS yaml scripts

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

--num_layers=80
--num_heads=64
--model_type=llama_v3
--tokenizer=hf-internal-testing/llama-tokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we can use this tokenizer for Llama3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated tokenizer for Llama 3 models.

--model_type=llama_v3
--tokenizer=hf-internal-testing/llama-tokenizer
--checkpoint_freq=5000
--validation_freq=500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the number steps, validation and checkpoint steps. This is running for too long other wise.
Let's 5 minutes of run time for workshops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated number steps, validation and checkpoint steps for Llama 3 models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are not reflected in the model files. Can you check?

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding Llama3 models into FSDP test cases.
Left some comments. Can you please add documentation regarding how to run Llama3 models? Happy to discuss since it start to create some clutter in the README with so many models.
Or maybe you plan to add that in the documentation PR?

@KeitaW
Copy link
Collaborator

KeitaW commented Jun 11, 2025

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the model file to change the number of step. Please check if you have committed those changes.

@allela-roy
Copy link
Contributor Author

Updated model files to update number of training steps.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allela-roy
Copy link
Contributor Author

Thanks, updated.

@mhuguesaws
Copy link
Contributor

Waiting for tests to run before merging.

)

export TORCHRUN=torchrun
export TRAIN_SCRIPT=./train.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase your PR on the latest FSDP change and regenerate the Slurm files.
The path to train.py is not correct for virtual environment settings and was fixed in #733.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments. Please address. The CI is currently failing for venv configuration.

@mhuguesaws
Copy link
Contributor

@allela-roy
Copy link
Contributor Author

Rebasing and resolving the above error.

@mhuguesaws
Copy link
Contributor

Waiting on container tests to complete.

@mhuguesaws mhuguesaws self-requested a review June 18, 2025 09:33
@mhuguesaws mhuguesaws merged commit bc7c21a into main Jun 18, 2025
24 of 90 checks passed
@mhuguesaws mhuguesaws deleted the llamav3 branch June 18, 2025 09:34
KeitaW pushed a commit that referenced this pull request Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants