Skip to content

Commit adaa9c5

Browse files
committed
update by comments
1 parent d05071f commit adaa9c5

File tree

1 file changed

+46
-39
lines changed

1 file changed

+46
-39
lines changed

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

Lines changed: 46 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,29 +2,29 @@
22

33
We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the
44
previous document.
5-
In this article, we will introduce how to craete a PaddlePaddle job with multiple nodes
5+
In this article, we will introduce how to create a PaddlePaddle job with multiple nodes
66
on Kubernetes cluster.
77

88
## Overall Architecture
99

10-
Before creating a training job, the users need to deploy the Python scripts and
11-
training data which have already been sliced on the precast path in the distributed file
12-
system(We can use the different type of Kuberentes Volumes to mount different distributed
13-
file system). Before start training, The program would copy the training data into the
10+
Before creating a training job, the users need to slice the training data and deploy
11+
the Python scripts along with it into the distributed file system
12+
(We can use the different type of Kuberentes Volumes to mount different distributed
13+
file systems). Before training starts, The program will copy the training data into the
1414
Container and also save the models at the same path during training. The global architecture
1515
is as follows:
1616

1717
![PaddlePaddle on Kubernetes Architecture](src/k8s-paddle-arch.png)
1818

1919
The above figure describes a distributed training architecture which contains 3 nodes, each
20-
Pod would mount a folder of the distributed file system to save training data and models
21-
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
22-
3 nodes, each Pod has a PaddlePaddle container. After the containers have been created,
23-
PaddlePaddle would start up the communication between PServer and Trainer and read training
20+
Pod mounts a folder of the distributed file system to save training data and models
21+
by Kubernetes Volume. Kubernetes created 3 Pods for this training phase and scheduled these on
22+
3 nodes, each Pod has a PaddlePaddle container. After the containers car created,
23+
PaddlePaddle starts up the communication between PServer and Trainer and read training
2424
data for this training job.
2525

26-
As the description above, we can start up a PaddlePaddle distributed training job on a ready
27-
Kubernetes cluster as the following steps:
26+
As the description above, we can start up a PaddlePaddle distributed training job on a
27+
Kubernetes ready cluster with the following steps:
2828

2929
1. [Build PaddlePaddle Docker Image](#Build a Docker Image)
3030
1. [Split training data and upload to the distributed file system](#Upload Training Data)
@@ -35,16 +35,13 @@ We will introduce these steps as follows:
3535

3636
### Build a Docker Image
3737

38-
PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and
39-
`Paddle Trainer` process and this Docker Image has the two import features:
38+
Training docker image needs to package the paddle pserver and paddle trainer runtimes, as well as two more processes before we can kick off the training:
4039

41-
- Copy the training data into the container.
42-
- Generate the start arguments of `Paddle PServer` and `Paddle Training` process.
40+
- Copying the training data into container.
41+
- Generating the initialization arguments for `Paddle PServer` and `Paddle Training` processes.
4342

44-
Because of the official Docker Image `paddlepaddle/paddle:latest` has already included the
45-
PaddlePaddle executable file, but above features so that we can use the official Docker Image as
46-
a base Image and add some additional scripts to finish the work of building a new image.
47-
You can reference [Dockerfile](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile).
43+
Since the paddlepaddle official docker image already has the runtimes we need, we'll take it as the base image and pack some additional scripts for the processes mentioned above to build our training image. for more detail, please find from the following link:
44+
- https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile
4845

4946

5047
```bash
@@ -58,17 +55,17 @@ And then upload the new Docker Image to a Docker hub:
5855
docker push [YOUR_REPO]/paddle:mypaddle
5956
```
6057

61-
**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository,
62-
you need to use your repository instead of it. We will use `[YOUR_REPO]/paddle:mypaddle` to
58+
**[NOTE]**, in the above command arguments, `[YOUR_REPO]` represents your Docker repository,
59+
you need to use your repository instead of it. We will replace it with your respository name to
6360
represent the Docker Image which built in this step.
6461

6562
### Prepare Training Data
6663

6764
We can download and split the training job by creating a Kubernetes Job, or custom your image
68-
by editing [k8s_train](./src/k8s_train/README.md).
65+
by editing [k8s_train](./src/k8s_train/).
6966

7067
Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of
71-
the different distributed file system, the generated dataset would be saved on this volume.
68+
the different file system, the generated dataset would be saved on this volume.
7269

7370
```yaml
7471
apiVersion: batch/v1
@@ -100,7 +97,13 @@ spec:
10097
restartPolicy: Never
10198
```
10299
103-
If success, you can see some information like this:
100+
Create the Job with the following command:
101+
102+
```bash
103+
> kubectl create -f xxx.yaml
104+
```
105+
106+
If created successfully, you can see some information like this:
104107

105108
```base
106109
[root@paddle-kubernetes-node0 nfsdir]$ tree -d
@@ -117,13 +120,13 @@ If success, you can see some information like this:
117120
```
118121

119122
The `paddle-cluster-job` above is the job name for this training job; we need 3
120-
PaddlePaddle training node and save the split training data on `paddle-cluster-job` path,
121-
the folder `0`, `1` and `2` representative the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs.
123+
PaddlePaddle training nodes and save the split training data in `paddle-cluster-job` path,
124+
the folder `0`, `1` and `2` represents the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs.
122125

123126

124127
### Create a Job
125128

126-
Kubernetes allow users to create an object with YAML files, and we can use a command-line tool
129+
Kubernetes allow users to create objects with YAML files, and we can use a command-line tool
127130
to create it.
128131

129132
The Job YAML file describes that which Docker Image would be used in this training job, how much nodes would be created, what's the startup arguments of `Paddle PServer/Trainer` process and what's the type of Volumes. You can find the details of the YAML filed in
@@ -177,8 +180,8 @@ spec:
177180
178181
In the above YAML file:
179182
- `metadata.name`, The job name.
180-
- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time.
181-
- `completions`, The Job would become the success status only the number of successful Pod(the exit code is 0)
183+
- `parallelism`, Whether the Kubernetes Job would create `parallelism` Pods at the same time.
184+
- `completions`, The Job would become the success status only when the number of successful Pod(the exit code is 0)
182185
is equal to `completions`.
183186
- `volumeMounts`, the name field `jobpath` is a key, the `mountPath` field represents
184187
the path in the container, and we can define the `jobpath` in `volumes` filed, use `hostPath`
@@ -209,13 +212,15 @@ kubectl create -f job.yaml
209212
```
210213

211214
Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node,
212-
, pull the Docker image and begin to train.
215+
pull the Docker image and begin to train.
213216

214217

215218
### Checkout the Output
216219

217-
At the process of training, we can check the logs and the output models, such as we store
218-
the output on `output` folder. **NOTE**, `node_0`, `node_1` and `node_2` represent the
220+
At the process of training, we can check the logs and the output models which is stored in
221+
the `output` folder.
222+
223+
**NOTE**, `node_0`, `node_1` and `node_2` represent the
219224
`trainer_id` of the PaddlePaddle training job rather than the node id of Kubernetes.
220225

221226
```bash
@@ -292,7 +297,7 @@ PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM")
292297

293298
### Communication between Pods
294299

295-
At the begin of `start_paddle.py`, it would initialize and parse the arguments.
300+
At the begin of `start_paddle.py`, it would initializes and parses the arguments.
296301

297302
```python
298303
parser = argparse.ArgumentParser(prog="start_paddle.py",
@@ -314,11 +319,12 @@ And then query the status of all the other Pods of this Job by the function `get
314319
idMap = getIdMap(podlist)
315320
```
316321

317-
**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
322+
**NOTE**: `getPodList()` would prefetch all the Pods in the current namespace, if some
323+
Pods are alreay running, it may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
318324
Kubernetes Pod or Replicaset in the future.
319325

320-
For the implement of `getIdMap(podlist)`, this function would fetch each IP address of
321-
`podlist` and then sort them to generate `trainer_id`.
326+
The function `getIdMap(podlist)` fetches IPs addresses of `podlist` and then sort them
327+
to generate `trainer_id`.
322328

323329
```python
324330
def getIdMap(podlist):
@@ -340,9 +346,10 @@ so that we can start up them by `startPaddle(idMap, train_args_dict)`.
340346

341347
### Create Job
342348

343-
The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get
344-
`PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM` and etc..., finally find `trainerId` from
345-
`idMap` according to its IP address.
349+
The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and
350+
`Paddle Trainer` processes. Take `Paddle Trainer` as an example, we parse the
351+
environment variable and then get `PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM` and etc...,
352+
finally find `trainerId` from `idMap` according to its IP address.
346353

347354
```python
348355
program = 'paddle train'

0 commit comments

Comments
 (0)