updating FSDP EKS documentation (#756)

allela-roy · mhuguesaws · web-flow · commit 22441f8d122c · 2025-06-21T03:48:37.000-05:00
* updating FSDP EKS documentation

* Update 3.test_cases/pytorch/FSDP/kubernetes/README.md

---------

Co-authored-by: mhuguesaws &lt;71357145+mhuguesaws@users.noreply.github.com&gt;
diff --git a/3.test_cases/pytorch/FSDP/kubernetes/README.md b/3.test_cases/pytorch/FSDP/kubernetes/README.md
@@ -2,25 +2,39 @@
 
 These scripts provide an easy way to get started with multinode [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) training on EKS. It is designed to be as simple as possible, requires no data preparation, and uses a container image. If you would like to run FSDP with SLURM, please refer to [README.md](../slurm/README.md).
 
-This document will run you through how to run Llama2 7B model training with FSDP. You will also find in this folder manifest to run Llama2 113b, 70B, Mistal 8x7B and Mistral Mathstral 7B models.
+This document will run you through how to run Llama 3.1 8B model training with FSDP. You will also find in this folder manifests to run Llama 2(7B, 13B, 70B), Llama 3.1(8B, 70B), Llama 3.2(1B, 3B),  Mistral 8x7b and Mistral Mathstral.
 
 ## 0. Prerequisites
 
 ### 0.1. EKS Cluster
 Before running this training, you'll need to create an Amazon EKS or a SageMaker HyperPod EKS cluster. Instructions can be found in [1.architectures](../../1.architectures), the [aws-do-eks](https://bit.ly/do-eks) project, or the [eks-blueprints](https://github.com/aws-ia/terraform-aws-eks-blueprints) project.
 
-### 0.2. awsome-distributed-training source code
+### 0.2. Connect to your EKS Cluster
+
+Run the [aws eks update-kubeconfig](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/update-kubeconfig.html)  command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command.
+
+```bash
+aws eks update-kubeconfig --name <EKS_CLUSTER_NAME>
+```
+You can verify that you are connected to the EKS cluster by running this commands:
+```bash
+kubectl config current-context
+```
+```
+arn:aws:eks:us-west-1:xxxxxxxxxxxx:cluster/xxx-eks-cluster
+```
+### 0.3. Clone the awsome-distributed-training reposource code
 Clone this repo. 
 
 ```
 git clone https://github.com/aws-samples/awsome-distributed-training/
 cd awsome-distributed-training/3.test_cases/pytorch/FSDP/kubernetes
 ```
 
-### 0.3. Envsubst
+### 0.4. Envsubst
 If the [envsubst](https://github.com/a8m/envsubst) utility is not available in your environment, please install it, following the instructions appropriate for your operating system.
 
-### 0.4. Kubeflow training operator
+### 0.5. Kubeflow training operator
 Deploy the Kubeflow training operator
 
 ```bash
@@ -36,11 +50,11 @@ export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'A
 export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
 pushd ../
-docker build -f Dockerfile -t ${REGISTRY}fsdp:pytorch2.5.1 .
+docker build -f Dockerfile -t ${REGISTRY}fsdp:pytorch2.7.1 .
 popd
 ```
 
-The PyTorch FSDP container uses the [nccl-tests](github.com/aws-samples/awsome-distributed-training/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) container as base.
+The PyTorch FSDP container uses the [nccl-tests](https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) container as base.
 
 ## 2. Push container image to Amazon ECR
 
@@ -58,26 +72,26 @@ echo "Logging in to $REGISTRY ..."
 aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
 
 # Push image to registry
-docker image push ${REGISTRY}fsdp:pytorch2.5.1
+docker image push ${REGISTRY}fsdp:pytorch2.7.1
 ```
 
 ## 3. Data
 
-For this example, we'll be using the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset. Instead of downloading the whole thing, the `create_streaming_dataloaders` function will stream the dataset from [HuggingFace](https://huggingface.co/datasets), so there's no data prep required for running this training.
+For this example, we'll be using the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset. Instead of downloading the entire dataset, the `create_streaming_dataloaders` function will stream the dataset from [HuggingFace](https://huggingface.co/datasets), so there's no data prep required for running this training.
 
 **For this dataset, we will need a Hugging Face access token**. First, create a [Hugging Face account](https://huggingface.co/welcome). Then [generate your access token with read permissions](https://huggingface.co/docs/hub/en/security-tokens). We will use this token and set it in our environment variables in the next step.
 
 If you'd like to instead use your own dataset, you can do so by [formatting it as a HuggingFace dataset](https://huggingface.co/docs/datasets/create_dataset), and passing its location to the `--dataset_path` argument.
 
-## 4. Launch Llama2 7B training job
+## 4. Launch Llama 3.1 8B training job
 
 Generate the Kubernetes manifest and apply it to the cluster.
 
 Create environment variables:
 
 ``` bash
 cat << EOF > env_vars
-export IMAGE_URI=${REGISTRY}fsdp:pytorch2.5.1
+export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1
 export INSTANCE_TYPE=<INSTANCE TYPE>
 export NUM_NODES=<NUMBER OF NODES>
 export GPU_PER_NODE=<NUMBER OF GPUS PER NODE>
@@ -86,21 +100,36 @@ export FI_PROVIDER=efa
 export HF_TOKEN=<YOUR HF ACCESS TOKEN>
 EOF
 ```
+
+For reference, we are running the Llama 3.1 8B model on 4 x p5.48xlarge instances and below is the configuration of our environment variables:
+``` bash
+cat << EOF > env_vars
+export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1
+export INSTANCE_TYPE=p5.48xlarge
+export NUM_NODES=4
+export GPU_PER_NODE=8
+export EFA_PER_NODE=32
+export FI_PROVIDER=efa
+export HF_TOKEN=<YOUR HF ACCESS TOKEN>
+EOF
+```
+
 Fill in `env_vars` and then source variables:
+
 ``` bash
 source env_vars
 ```
 
 Apply yaml:
 ``` bash
-envsubst < llama2_7b-fsdp.yaml | kubectl apply -f - 
+envsubst < llama3_1_8b-fsdp.yaml | kubectl apply -f -
 ```
 
 EFA level variables are available for adjustment in fsdp.yaml-template
 Keep FI_* values commented out for non-efa instances (G5, G4d, P3) or P5
 Uncomment FI_* values for P4d instances
 
-You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 2 70b). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint.
+You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 3.1 70B). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint.
 
 ## 5. Monitor training job
 
@@ -112,62 +141,62 @@ kubectl get pods
 ```
 
 ```log
-NAME   STATE     AGE
-fsdp   Running   5m
-
-NAME                    READY   STATUS    RESTARTS        AGE
-fsdp-worker-0           1/1     Running   0               5m
-fsdp-worker-1           1/1     Running   0               5m
-fsdp-worker-2           1/1     Running   0               5m
-fsdp-worker-3           1/1     Running   0               5m
-fsdp-worker-4           1/1     Running   0               5m
-fsdp-worker-5           1/1     Running   0               5m
-fsdp-worker-6           1/1     Running   0               5m
-fsdp-worker-7           1/1     Running   0               5m
+NAME               STATE     AGE
+llama3-1-8b-fsdp   Running   5m38s
+NAME                        READY   STATUS    RESTARTS   AGE
+llama3-1-8b-fsdp-worker-0   1/1     Running   0          5m39s
+llama3-1-8b-fsdp-worker-1   1/1     Running   0          5m39s
+llama3-1-8b-fsdp-worker-2   1/1     Running   0          5m39s
+llama3-1-8b-fsdp-worker-3   1/1     Running   0          5m39s
 ```
 
 Each of the pods produces job logs. One of the pods is elected master during job initialization. Only this pod will show the progress of the training job in its log. To find out which pod is currently the master, run the command below.
 
 ```bash
-kubectl logs fsdp-worker-0 | grep master_addr=
+kubectl logs llama3-1-8b-fsdp-worker-0 | grep master_addr=
 ```
 
 ```log
-[2024-06-11 18:59:56,193] torch.distributed.elastic.agent.server.api: [INFO]   master_addr=fsdp-worker-1
+I0620 14:27:39.789000 1 torch/distributed/elastic/agent/server/api.py:525]   master_addr=llama3-1-8b-fsdp-worker-0
 ```
 
-This shows that the pod `fsdp-worker-1` is currently the master. To look at the current job logs, use the command below:
+This shows that the pod `llama3-1-8b-fsdp-worker-0` is currently the master. To look at the current job logs, use the command below:
 
 ```bash
-kubectl logs -f fsdp-worker-1
+kubectl logs -f llama3-1-8b-fsdp-worker-0
 ```
 
 ```log
 ...
-2024-06-12 00:08:25 I [train.py:102] Batch 979 Loss: 5.63272, Speed: 0.43 samples/sec, lr: 0.000091
-2024-06-12 00:08:44 I [train.py:102] Batch 980 Loss: 5.63327, Speed: 0.43 samples/sec, lr: 0.000091
-2024-06-12 00:09:03 I [train.py:102] Batch 981 Loss: 5.95147, Speed: 0.43 samples/sec, lr: 0.000091
-2024-06-12 00:09:21 I [train.py:102] Batch 982 Loss: 5.45894, Speed: 0.43 samples/sec, lr: 0.000091
+2025-06-20 14:17:10 I [train.py:103] Batch 90 Loss: 7.24291, Speed: 9.41 samples/sec, lr: 0.000010
+2025-06-20 14:17:14 I [train.py:103] Batch 91 Loss: 7.27470, Speed: 8.94 samples/sec, lr: 0.000010
+2025-06-20 14:17:17 I [train.py:103] Batch 92 Loss: 7.06632, Speed: 9.42 samples/sec, lr: 0.000010
+2025-06-20 14:17:21 I [train.py:103] Batch 93 Loss: 7.17624, Speed: 8.96 samples/sec, lr: 0.000010
+2025-06-20 14:17:24 I [train.py:103] Batch 94 Loss: 7.24291, Speed: 9.06 samples/sec, lr: 0.000010
+2025-06-20 14:17:28 I [train.py:103] Batch 95 Loss: 7.13051, Speed: 9.05 samples/sec, lr: 0.000010
+2025-06-20 14:17:32 I [train.py:103] Batch 96 Loss: 7.16901, Speed: 8.30 samples/sec, lr: 0.000010
+2025-06-20 14:17:36 I [train.py:103] Batch 97 Loss: 7.50217, Speed: 8.51 samples/sec, lr: 0.000010
 ```
 
 ## 6. Stop training job
 
 To stop the current training job, use the following command.
 
 ```bash
-kubectl delete -f ./llama2_7b-fsdp.yaml
+kubectl delete -f ./llama3_1_8b-fsdp.yaml
 ```
 
 If you wish to launch a new job, you must first stop the previous one, even if it is in `Completed` state.
 
 ## References
-Llama2 models parameters based on the values in the [Llama 2 paper](https://arxiv.org/abs/2307.09288).
+Llama 2 and  Llama 3.x models parameters are based on the values in the [Llama 2 paper](https://arxiv.org/abs/2307.09288) and [Llama 3 paper](https://arxiv.org/abs/2407.21783) 
 
-| Param                    |     7B      |     13B     |     70B     |
-| ------------------------ | ----------- | ----------- | ----------- |
-| intermediate_size        | 11008       | 13824       | 28672       |
-| num_key_value_heads      | 32          | 40          | 8           |
-| hidden_width             | 4096        | 5120        | 8192        |
-| num_layers               | 32          | 40          | 80          |
-| num_heads                | 32          | 40          | 64          |
 
+| Parameter            | Llama 2 7B | Llama 2 13B | Llama 2 70B | Llama 3.1 8B | Llama 3.1 70B | Llama 3.2 1B | Llama 3.2 3B |
+|----------------------|------------|-------------|-------------|--------------|---------------|--------------|--------------|
+| intermediate_size    | 11008      | 13824       | 28672       | 14336        | 28672         | 8192         | 11008        |
+| num_key_value_heads  | 32         | 40          | 8           | 8            | 8             | 8            | 8            |
+| hidden_width         | 4096       | 5120        | 8192        | 4096         | 8192          | 2048         | 3072         |
+| num_layers           | 32         | 40          | 80          | 32           | 80            | 16           | 28           |
+| num_heads            | 32         | 40          | 64          | 32           | 64            | 32           | 24           |
+| max_context_length   | 4096       | 4096        | 4096        | 8192         | 8192          | 8192         | 8192         |