Skip to content

Commit 22441f8

Browse files
updating FSDP EKS documentation (#756)
* updating FSDP EKS documentation * Update 3.test_cases/pytorch/FSDP/kubernetes/README.md --------- Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
1 parent e65ab40 commit 22441f8

File tree

1 file changed

+70
-41
lines changed
  • 3.test_cases/pytorch/FSDP/kubernetes

1 file changed

+70
-41
lines changed

3.test_cases/pytorch/FSDP/kubernetes/README.md

Lines changed: 70 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,39 @@
22

33
These scripts provide an easy way to get started with multinode [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) training on EKS. It is designed to be as simple as possible, requires no data preparation, and uses a container image. If you would like to run FSDP with SLURM, please refer to [README.md](../slurm/README.md).
44

5-
This document will run you through how to run Llama2 7B model training with FSDP. You will also find in this folder manifest to run Llama2 113b, 70B, Mistal 8x7B and Mistral Mathstral 7B models.
5+
This document will run you through how to run Llama 3.1 8B model training with FSDP. You will also find in this folder manifests to run Llama 2(7B, 13B, 70B), Llama 3.1(8B, 70B), Llama 3.2(1B, 3B), Mistral 8x7b and Mistral Mathstral.
66

77
## 0. Prerequisites
88

99
### 0.1. EKS Cluster
1010
Before running this training, you'll need to create an Amazon EKS or a SageMaker HyperPod EKS cluster. Instructions can be found in [1.architectures](../../1.architectures), the [aws-do-eks](https://bit.ly/do-eks) project, or the [eks-blueprints](https://github.com/aws-ia/terraform-aws-eks-blueprints) project.
1111

12-
### 0.2. awsome-distributed-training source code
12+
### 0.2. Connect to your EKS Cluster
13+
14+
Run the [aws eks update-kubeconfig](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/update-kubeconfig.html) command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command.
15+
16+
```bash
17+
aws eks update-kubeconfig --name <EKS_CLUSTER_NAME>
18+
```
19+
You can verify that you are connected to the EKS cluster by running this commands:
20+
```bash
21+
kubectl config current-context
22+
```
23+
```
24+
arn:aws:eks:us-west-1:xxxxxxxxxxxx:cluster/xxx-eks-cluster
25+
```
26+
### 0.3. Clone the awsome-distributed-training reposource code
1327
Clone this repo.
1428

1529
```
1630
git clone https://github.com/aws-samples/awsome-distributed-training/
1731
cd awsome-distributed-training/3.test_cases/pytorch/FSDP/kubernetes
1832
```
1933

20-
### 0.3. Envsubst
34+
### 0.4. Envsubst
2135
If the [envsubst](https://github.com/a8m/envsubst) utility is not available in your environment, please install it, following the instructions appropriate for your operating system.
2236

23-
### 0.4. Kubeflow training operator
37+
### 0.5. Kubeflow training operator
2438
Deploy the Kubeflow training operator
2539

2640
```bash
@@ -36,11 +50,11 @@ export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'A
3650
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
3751
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
3852
pushd ../
39-
docker build -f Dockerfile -t ${REGISTRY}fsdp:pytorch2.5.1 .
53+
docker build -f Dockerfile -t ${REGISTRY}fsdp:pytorch2.7.1 .
4054
popd
4155
```
4256

43-
The PyTorch FSDP container uses the [nccl-tests](github.com/aws-samples/awsome-distributed-training/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) container as base.
57+
The PyTorch FSDP container uses the [nccl-tests](https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) container as base.
4458

4559
## 2. Push container image to Amazon ECR
4660

@@ -58,26 +72,26 @@ echo "Logging in to $REGISTRY ..."
5872
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
5973

6074
# Push image to registry
61-
docker image push ${REGISTRY}fsdp:pytorch2.5.1
75+
docker image push ${REGISTRY}fsdp:pytorch2.7.1
6276
```
6377

6478
## 3. Data
6579

66-
For this example, we'll be using the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset. Instead of downloading the whole thing, the `create_streaming_dataloaders` function will stream the dataset from [HuggingFace](https://huggingface.co/datasets), so there's no data prep required for running this training.
80+
For this example, we'll be using the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset. Instead of downloading the entire dataset, the `create_streaming_dataloaders` function will stream the dataset from [HuggingFace](https://huggingface.co/datasets), so there's no data prep required for running this training.
6781

6882
**For this dataset, we will need a Hugging Face access token**. First, create a [Hugging Face account](https://huggingface.co/welcome). Then [generate your access token with read permissions](https://huggingface.co/docs/hub/en/security-tokens). We will use this token and set it in our environment variables in the next step.
6983

7084
If you'd like to instead use your own dataset, you can do so by [formatting it as a HuggingFace dataset](https://huggingface.co/docs/datasets/create_dataset), and passing its location to the `--dataset_path` argument.
7185

72-
## 4. Launch Llama2 7B training job
86+
## 4. Launch Llama 3.1 8B training job
7387

7488
Generate the Kubernetes manifest and apply it to the cluster.
7589

7690
Create environment variables:
7791

7892
``` bash
7993
cat << EOF > env_vars
80-
export IMAGE_URI=${REGISTRY}fsdp:pytorch2.5.1
94+
export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1
8195
export INSTANCE_TYPE=<INSTANCE TYPE>
8296
export NUM_NODES=<NUMBER OF NODES>
8397
export GPU_PER_NODE=<NUMBER OF GPUS PER NODE>
@@ -86,21 +100,36 @@ export FI_PROVIDER=efa
86100
export HF_TOKEN=<YOUR HF ACCESS TOKEN>
87101
EOF
88102
```
103+
104+
For reference, we are running the Llama 3.1 8B model on 4 x p5.48xlarge instances and below is the configuration of our environment variables:
105+
``` bash
106+
cat << EOF > env_vars
107+
export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1
108+
export INSTANCE_TYPE=p5.48xlarge
109+
export NUM_NODES=4
110+
export GPU_PER_NODE=8
111+
export EFA_PER_NODE=32
112+
export FI_PROVIDER=efa
113+
export HF_TOKEN=<YOUR HF ACCESS TOKEN>
114+
EOF
115+
```
116+
89117
Fill in `env_vars` and then source variables:
118+
90119
``` bash
91120
source env_vars
92121
```
93122

94123
Apply yaml:
95124
``` bash
96-
envsubst < llama2_7b-fsdp.yaml | kubectl apply -f -
125+
envsubst < llama3_1_8b-fsdp.yaml | kubectl apply -f -
97126
```
98127

99128
EFA level variables are available for adjustment in fsdp.yaml-template
100129
Keep FI_* values commented out for non-efa instances (G5, G4d, P3) or P5
101130
Uncomment FI_* values for P4d instances
102131

103-
You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 2 70b). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint.
132+
You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 3.1 70B). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint.
104133

105134
## 5. Monitor training job
106135

@@ -112,62 +141,62 @@ kubectl get pods
112141
```
113142

114143
```log
115-
NAME STATE AGE
116-
fsdp Running 5m
117-
118-
NAME READY STATUS RESTARTS AGE
119-
fsdp-worker-0 1/1 Running 0 5m
120-
fsdp-worker-1 1/1 Running 0 5m
121-
fsdp-worker-2 1/1 Running 0 5m
122-
fsdp-worker-3 1/1 Running 0 5m
123-
fsdp-worker-4 1/1 Running 0 5m
124-
fsdp-worker-5 1/1 Running 0 5m
125-
fsdp-worker-6 1/1 Running 0 5m
126-
fsdp-worker-7 1/1 Running 0 5m
144+
NAME STATE AGE
145+
llama3-1-8b-fsdp Running 5m38s
146+
NAME READY STATUS RESTARTS AGE
147+
llama3-1-8b-fsdp-worker-0 1/1 Running 0 5m39s
148+
llama3-1-8b-fsdp-worker-1 1/1 Running 0 5m39s
149+
llama3-1-8b-fsdp-worker-2 1/1 Running 0 5m39s
150+
llama3-1-8b-fsdp-worker-3 1/1 Running 0 5m39s
127151
```
128152

129153
Each of the pods produces job logs. One of the pods is elected master during job initialization. Only this pod will show the progress of the training job in its log. To find out which pod is currently the master, run the command below.
130154

131155
```bash
132-
kubectl logs fsdp-worker-0 | grep master_addr=
156+
kubectl logs llama3-1-8b-fsdp-worker-0 | grep master_addr=
133157
```
134158

135159
```log
136-
[2024-06-11 18:59:56,193] torch.distributed.elastic.agent.server.api: [INFO] master_addr=fsdp-worker-1
160+
I0620 14:27:39.789000 1 torch/distributed/elastic/agent/server/api.py:525] master_addr=llama3-1-8b-fsdp-worker-0
137161
```
138162

139-
This shows that the pod `fsdp-worker-1` is currently the master. To look at the current job logs, use the command below:
163+
This shows that the pod `llama3-1-8b-fsdp-worker-0` is currently the master. To look at the current job logs, use the command below:
140164

141165
```bash
142-
kubectl logs -f fsdp-worker-1
166+
kubectl logs -f llama3-1-8b-fsdp-worker-0
143167
```
144168

145169
```log
146170
...
147-
2024-06-12 00:08:25 I [train.py:102] Batch 979 Loss: 5.63272, Speed: 0.43 samples/sec, lr: 0.000091
148-
2024-06-12 00:08:44 I [train.py:102] Batch 980 Loss: 5.63327, Speed: 0.43 samples/sec, lr: 0.000091
149-
2024-06-12 00:09:03 I [train.py:102] Batch 981 Loss: 5.95147, Speed: 0.43 samples/sec, lr: 0.000091
150-
2024-06-12 00:09:21 I [train.py:102] Batch 982 Loss: 5.45894, Speed: 0.43 samples/sec, lr: 0.000091
171+
2025-06-20 14:17:10 I [train.py:103] Batch 90 Loss: 7.24291, Speed: 9.41 samples/sec, lr: 0.000010
172+
2025-06-20 14:17:14 I [train.py:103] Batch 91 Loss: 7.27470, Speed: 8.94 samples/sec, lr: 0.000010
173+
2025-06-20 14:17:17 I [train.py:103] Batch 92 Loss: 7.06632, Speed: 9.42 samples/sec, lr: 0.000010
174+
2025-06-20 14:17:21 I [train.py:103] Batch 93 Loss: 7.17624, Speed: 8.96 samples/sec, lr: 0.000010
175+
2025-06-20 14:17:24 I [train.py:103] Batch 94 Loss: 7.24291, Speed: 9.06 samples/sec, lr: 0.000010
176+
2025-06-20 14:17:28 I [train.py:103] Batch 95 Loss: 7.13051, Speed: 9.05 samples/sec, lr: 0.000010
177+
2025-06-20 14:17:32 I [train.py:103] Batch 96 Loss: 7.16901, Speed: 8.30 samples/sec, lr: 0.000010
178+
2025-06-20 14:17:36 I [train.py:103] Batch 97 Loss: 7.50217, Speed: 8.51 samples/sec, lr: 0.000010
151179
```
152180

153181
## 6. Stop training job
154182

155183
To stop the current training job, use the following command.
156184

157185
```bash
158-
kubectl delete -f ./llama2_7b-fsdp.yaml
186+
kubectl delete -f ./llama3_1_8b-fsdp.yaml
159187
```
160188

161189
If you wish to launch a new job, you must first stop the previous one, even if it is in `Completed` state.
162190

163191
## References
164-
Llama2 models parameters based on the values in the [Llama 2 paper](https://arxiv.org/abs/2307.09288).
192+
Llama 2 and Llama 3.x models parameters are based on the values in the [Llama 2 paper](https://arxiv.org/abs/2307.09288) and [Llama 3 paper](https://arxiv.org/abs/2407.21783)
165193

166-
| Param | 7B | 13B | 70B |
167-
| ------------------------ | ----------- | ----------- | ----------- |
168-
| intermediate_size | 11008 | 13824 | 28672 |
169-
| num_key_value_heads | 32 | 40 | 8 |
170-
| hidden_width | 4096 | 5120 | 8192 |
171-
| num_layers | 32 | 40 | 80 |
172-
| num_heads | 32 | 40 | 64 |
173194

195+
| Parameter | Llama 2 7B | Llama 2 13B | Llama 2 70B | Llama 3.1 8B | Llama 3.1 70B | Llama 3.2 1B | Llama 3.2 3B |
196+
|----------------------|------------|-------------|-------------|--------------|---------------|--------------|--------------|
197+
| intermediate_size | 11008 | 13824 | 28672 | 14336 | 28672 | 8192 | 11008 |
198+
| num_key_value_heads | 32 | 40 | 8 | 8 | 8 | 8 | 8 |
199+
| hidden_width | 4096 | 5120 | 8192 | 4096 | 8192 | 2048 | 3072 |
200+
| num_layers | 32 | 40 | 80 | 32 | 80 | 16 | 28 |
201+
| num_heads | 32 | 40 | 64 | 32 | 64 | 32 | 24 |
202+
| max_context_length | 4096 | 4096 | 4096 | 8192 | 8192 | 8192 | 8192 |

0 commit comments

Comments
 (0)