You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 3.test_cases/pytorch/FSDP/kubernetes/README.md
+70-41Lines changed: 70 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,25 +2,39 @@
2
2
3
3
These scripts provide an easy way to get started with multinode [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) training on EKS. It is designed to be as simple as possible, requires no data preparation, and uses a container image. If you would like to run FSDP with SLURM, please refer to [README.md](../slurm/README.md).
4
4
5
-
This document will run you through how to run Llama2 7B model training with FSDP. You will also find in this folder manifest to run Llama2 113b, 70B, Mistal 8x7B and Mistral Mathstral 7B models.
5
+
This document will run you through how to run Llama 3.1 8B model training with FSDP. You will also find in this folder manifests to run Llama 2(7B, 13B, 70B), Llama 3.1(8B, 70B), Llama 3.2(1B, 3B), Mistral 8x7b and Mistral Mathstral.
6
6
7
7
## 0. Prerequisites
8
8
9
9
### 0.1. EKS Cluster
10
10
Before running this training, you'll need to create an Amazon EKS or a SageMaker HyperPod EKS cluster. Instructions can be found in [1.architectures](../../1.architectures), the [aws-do-eks](https://bit.ly/do-eks) project, or the [eks-blueprints](https://github.com/aws-ia/terraform-aws-eks-blueprints) project.
11
11
12
-
### 0.2. awsome-distributed-training source code
12
+
### 0.2. Connect to your EKS Cluster
13
+
14
+
Run the [aws eks update-kubeconfig](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/update-kubeconfig.html) command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command.
cd awsome-distributed-training/3.test_cases/pytorch/FSDP/kubernetes
18
32
```
19
33
20
-
### 0.3. Envsubst
34
+
### 0.4. Envsubst
21
35
If the [envsubst](https://github.com/a8m/envsubst) utility is not available in your environment, please install it, following the instructions appropriate for your operating system.
22
36
23
-
### 0.4. Kubeflow training operator
37
+
### 0.5. Kubeflow training operator
24
38
Deploy the Kubeflow training operator
25
39
26
40
```bash
@@ -36,11 +50,11 @@ export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'A
The PyTorch FSDP container uses the [nccl-tests](github.com/aws-samples/awsome-distributed-training/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) container as base.
57
+
The PyTorch FSDP container uses the [nccl-tests](https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) container as base.
44
58
45
59
## 2. Push container image to Amazon ECR
46
60
@@ -58,26 +72,26 @@ echo "Logging in to $REGISTRY ..."
For this example, we'll be using the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset. Instead of downloading the whole thing, the `create_streaming_dataloaders` function will stream the dataset from [HuggingFace](https://huggingface.co/datasets), so there's no data prep required for running this training.
80
+
For this example, we'll be using the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset. Instead of downloading the entire dataset, the `create_streaming_dataloaders` function will stream the dataset from [HuggingFace](https://huggingface.co/datasets), so there's no data prep required for running this training.
67
81
68
82
**For this dataset, we will need a Hugging Face access token**. First, create a [Hugging Face account](https://huggingface.co/welcome). Then [generate your access token with read permissions](https://huggingface.co/docs/hub/en/security-tokens). We will use this token and set it in our environment variables in the next step.
69
83
70
84
If you'd like to instead use your own dataset, you can do so by [formatting it as a HuggingFace dataset](https://huggingface.co/docs/datasets/create_dataset), and passing its location to the `--dataset_path` argument.
71
85
72
-
## 4. Launch Llama2 7B training job
86
+
## 4. Launch Llama 3.1 8B training job
73
87
74
88
Generate the Kubernetes manifest and apply it to the cluster.
75
89
76
90
Create environment variables:
77
91
78
92
```bash
79
93
cat <<EOF > env_vars
80
-
export IMAGE_URI=${REGISTRY}fsdp:pytorch2.5.1
94
+
export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1
81
95
export INSTANCE_TYPE=<INSTANCE TYPE>
82
96
export NUM_NODES=<NUMBER OF NODES>
83
97
export GPU_PER_NODE=<NUMBER OF GPUS PER NODE>
@@ -86,21 +100,36 @@ export FI_PROVIDER=efa
86
100
export HF_TOKEN=<YOUR HF ACCESS TOKEN>
87
101
EOF
88
102
```
103
+
104
+
For reference, we are running the Llama 3.1 8B model on 4 x p5.48xlarge instances and below is the configuration of our environment variables:
EFA level variables are available for adjustment in fsdp.yaml-template
100
129
Keep FI_* values commented out for non-efa instances (G5, G4d, P3) or P5
101
130
Uncomment FI_* values for P4d instances
102
131
103
-
You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 2 70b). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint.
132
+
You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 3.1 70B). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint.
104
133
105
134
## 5. Monitor training job
106
135
@@ -112,62 +141,62 @@ kubectl get pods
112
141
```
113
142
114
143
```log
115
-
NAME STATE AGE
116
-
fsdp Running 5m
117
-
118
-
NAME READY STATUS RESTARTS AGE
119
-
fsdp-worker-0 1/1 Running 0 5m
120
-
fsdp-worker-1 1/1 Running 0 5m
121
-
fsdp-worker-2 1/1 Running 0 5m
122
-
fsdp-worker-3 1/1 Running 0 5m
123
-
fsdp-worker-4 1/1 Running 0 5m
124
-
fsdp-worker-5 1/1 Running 0 5m
125
-
fsdp-worker-6 1/1 Running 0 5m
126
-
fsdp-worker-7 1/1 Running 0 5m
144
+
NAME STATE AGE
145
+
llama3-1-8b-fsdp Running 5m38s
146
+
NAME READY STATUS RESTARTS AGE
147
+
llama3-1-8b-fsdp-worker-0 1/1 Running 0 5m39s
148
+
llama3-1-8b-fsdp-worker-1 1/1 Running 0 5m39s
149
+
llama3-1-8b-fsdp-worker-2 1/1 Running 0 5m39s
150
+
llama3-1-8b-fsdp-worker-3 1/1 Running 0 5m39s
127
151
```
128
152
129
153
Each of the pods produces job logs. One of the pods is elected master during job initialization. Only this pod will show the progress of the training job in its log. To find out which pod is currently the master, run the command below.
To stop the current training job, use the following command.
156
184
157
185
```bash
158
-
kubectl delete -f ./llama2_7b-fsdp.yaml
186
+
kubectl delete -f ./llama3_1_8b-fsdp.yaml
159
187
```
160
188
161
189
If you wish to launch a new job, you must first stop the previous one, even if it is in `Completed` state.
162
190
163
191
## References
164
-
Llama2 models parameters based on the values in the [Llama 2 paper](https://arxiv.org/abs/2307.09288).
192
+
Llama 2 and Llama 3.x models parameters are based on the values in the [Llama 2 paper](https://arxiv.org/abs/2307.09288) and [Llama 3 paper](https://arxiv.org/abs/2407.21783)
0 commit comments