adding llamav3 support on slurm and EKS#737
Conversation
| --num_layers=80 | ||
| --num_heads=64 | ||
| --model_type=llama_v3 | ||
| --tokenizer=hf-internal-testing/llama-tokenizer |
There was a problem hiding this comment.
Are we sure we can use this tokenizer for Llama3?
There was a problem hiding this comment.
updated tokenizer for Llama 3 models.
| --model_type=llama_v3 | ||
| --tokenizer=hf-internal-testing/llama-tokenizer | ||
| --checkpoint_freq=5000 | ||
| --validation_freq=500 |
There was a problem hiding this comment.
Please change the number steps, validation and checkpoint steps. This is running for too long other wise.
Let's 5 minutes of run time for workshops.
There was a problem hiding this comment.
updated number steps, validation and checkpoint steps for Llama 3 models.
There was a problem hiding this comment.
Changes are not reflected in the model files. Can you check?
There was a problem hiding this comment.
Thank you for adding Llama3 models into FSDP test cases.
Left some comments. Can you please add documentation regarding how to run Llama3 models? Happy to discuss since it start to create some clutter in the README with so many models.
Or maybe you plan to add that in the documentation PR?
|
Perhaps we want to have sub-directory per model? https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/picotron |
mhuguesaws
left a comment
There was a problem hiding this comment.
Need to change the model file to change the number of step. Please check if you have committed those changes.
|
Updated model files to update number of training steps. |
mhuguesaws
left a comment
There was a problem hiding this comment.
Overall good pull request to add Llama3 models.
Can you add CI testing testing in https://github.com/aws-samples/awsome-distributed-training/blob/main/.github/workflows/fsdp-regression-test-container.yml#L23 and https://github.com/aws-samples/awsome-distributed-training/blob/main/.github/workflows/fsdp-regression-test-venv.yml#L23? Then we are good to merge.
|
Thanks, updated. |
|
Waiting for tests to run before merging. |
| ) | ||
|
|
||
| export TORCHRUN=torchrun | ||
| export TRAIN_SCRIPT=./train.py |
There was a problem hiding this comment.
Please rebase your PR on the latest FSDP change and regenerate the Slurm files.
The path to train.py is not correct for virtual environment settings and was fixed in #733.
mhuguesaws
left a comment
There was a problem hiding this comment.
Left comments. Please address. The CI is currently failing for venv configuration.
|
Other problem that the model does not run https://github.com/aws-samples/awsome-distributed-training/actions/runs/15692204475/job/44240922662 |
|
Rebasing and resolving the above error. |
|
Waiting on container tests to complete. |
Issue #, if available:
Description of changes:
added Llama 3 model configs and generated corresponding Slurm sbatch and EKS yaml scripts
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.