Skip to content

Commit 1d88ebe

Browse files
author
Yancey
authored
Merge pull request #9789 from Yancey1989/k8s_dist_doc_en
k8s dist train for en
2 parents 0ad892a + 875d48d commit 1d88ebe

File tree

1 file changed

+371
-2
lines changed

1 file changed

+371
-2
lines changed
Lines changed: 371 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,372 @@
1-
# Kubernetes Distributed
1+
# Distributed Training on Kubernetes
22

3-
TBD
3+
We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the
4+
previous document.
5+
In this article, we will introduce how to create a PaddlePaddle job with multiple nodes
6+
on Kubernetes cluster.
7+
8+
## Overall Architecture
9+
10+
Before creating a training job, the users need to slice the training data and deploy
11+
the Python scripts along with it into the distributed file system
12+
(We can use the different type of Kuberentes Volumes to mount different distributed
13+
file systems). Before training starts, The program will copy the training data into the
14+
Container and also save the models at the same path during training. The global architecture
15+
is as follows:
16+
17+
![PaddlePaddle on Kubernetes Architecture](src/k8s-paddle-arch.png)
18+
19+
The above figure describes a distributed training architecture which contains 3 nodes, each
20+
Pod mounts a folder of the distributed file system to save training data and models
21+
by Kubernetes Volume. Kubernetes created 3 Pods for this training phase and scheduled these on
22+
3 nodes, each Pod has a PaddlePaddle container. After the containers car created,
23+
PaddlePaddle starts up the communication between PServer and Trainer and read training
24+
data for this training job.
25+
26+
As the description above, we can start up a PaddlePaddle distributed training job on a
27+
Kubernetes ready cluster with the following steps:
28+
29+
1. [Build PaddlePaddle Docker Image](#Build a Docker Image)
30+
1. [Split training data and upload to the distributed file system](#Upload Training Data)
31+
1. [Edit a YAML file and create a Kubernetes Job](#Create a Job)
32+
1. [Check the output](#Check The Output)
33+
34+
We will introduce these steps as follows:
35+
36+
### Build a Docker Image
37+
38+
Training docker image needs to package the paddle pserver and paddle trainer runtimes, as well as two more processes before we can kick off the training:
39+
40+
- Copying the training data into container.
41+
- Generating the initialization arguments for `Paddle PServer` and `Paddle Training` processes.
42+
43+
Since the paddlepaddle official docker image already has the runtimes we need, we'll take it as the base image and pack some additional scripts for the processes mentioned above to build our training image. for more detail, please find from the following link:
44+
- https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile
45+
46+
47+
```bash
48+
$ cd doc/howto/usage/k8s/src/k8s_train
49+
$ docker build -t [YOUR_REPO]/paddle:mypaddle .
50+
```
51+
52+
And then upload the new Docker Image to a Docker hub:
53+
54+
```bash
55+
docker push [YOUR_REPO]/paddle:mypaddle
56+
```
57+
58+
**[NOTE]**, in the above command arguments, `[YOUR_REPO]` represents your Docker repository,
59+
you need to use your repository instead of it. We will replace it with your respository name to
60+
represent the Docker Image which built in this step.
61+
62+
### Prepare Training Data
63+
64+
We can download and split the training job by creating a Kubernetes Job, or custom your image
65+
by editing [k8s_train](./src/k8s_train/).
66+
67+
Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of
68+
the different file system, the generated dataset would be saved on this volume.
69+
70+
```yaml
71+
apiVersion: batch/v1
72+
kind: Job
73+
metadata:
74+
name: paddle-data
75+
spec:
76+
template:
77+
metadata:
78+
name: pi
79+
spec:
80+
hostNetwork: true
81+
containers:
82+
- name: paddle-data
83+
image: paddlepaddle/paddle-tutorial:k8s_data
84+
imagePullPolicy: Always
85+
volumeMounts:
86+
- mountPath: "/mnt"
87+
name: nfs
88+
env:
89+
- name: OUT_DIR
90+
value: /home/work/mfs/paddle-cluster-job
91+
- name: SPLIT_COUNT
92+
value: "3"
93+
volumes:
94+
- name: nfs
95+
persistentVolumeClaim:
96+
claimName: mfs
97+
restartPolicy: Never
98+
```
99+
100+
Create the Job with the following command:
101+
102+
```bash
103+
> kubectl create -f xxx.yaml
104+
```
105+
106+
If created successfully, you can see some information like this:
107+
108+
```base
109+
[root@paddle-kubernetes-node0 nfsdir]$ tree -d
110+
.
111+
`-- paddle-cluster-job
112+
|-- 0
113+
| `-- data
114+
|-- 1
115+
| `-- data
116+
|-- 2
117+
| `-- data
118+
|-- output
119+
|-- quick_start
120+
```
121+
122+
The `paddle-cluster-job` above is the job name for this training job; we need 3
123+
PaddlePaddle training nodes and save the split training data in `paddle-cluster-job` path,
124+
the folder `0`, `1` and `2` represents the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs.
125+
126+
127+
### Create a Job
128+
129+
Kubernetes allow users to create objects with YAML files, and we can use a command-line tool
130+
to create it.
131+
132+
The Job YAML file describes that which Docker Image would be used in this training job, how much nodes would be created, what's the startup arguments of `Paddle PServer/Trainer` process and what's the type of Volumes. You can find the details of the YAML filed in
133+
[Kubernetes Job API](http://kubernetes.io/docs/api-reference/batch/v1/definitions/#_v1_job).
134+
The following is an example for this training job:
135+
136+
```yaml
137+
apiVersion: batch/v1
138+
kind: Job
139+
metadata:
140+
name: paddle-cluster-job
141+
spec:
142+
parallelism: 3
143+
completions: 3
144+
template:
145+
metadata:
146+
name: paddle-cluster-job
147+
spec:
148+
volumes:
149+
- name: jobpath
150+
hostPath:
151+
path: /home/work/mfs
152+
containers:
153+
- name: trainer
154+
image: [YOUR_REPO]/paddle:mypaddle
155+
command: ["bin/bash", "-c", "/root/start.sh"]
156+
env:
157+
- name: JOB_NAME
158+
value: paddle-cluster-job
159+
- name: JOB_PATH
160+
value: /home/jobpath
161+
- name: JOB_NAMESPACE
162+
value: default
163+
- name: TRAIN_CONFIG_DIR
164+
value: recommendation
165+
- name: CONF_PADDLE_NIC
166+
value: eth0
167+
- name: CONF_PADDLE_PORT
168+
value: "7164"
169+
- name: CONF_PADDLE_PORTS_NUM
170+
value: "2"
171+
- name: CONF_PADDLE_PORTS_NUM_SPARSE
172+
value: "2"
173+
- name: CONF_PADDLE_GRADIENT_NUM
174+
value: "3"
175+
volumeMounts:
176+
- name: jobpath
177+
mountPath: /home/jobpath
178+
restartPolicy: Never
179+
```
180+
181+
In the above YAML file:
182+
- `metadata.name`, The job name.
183+
- `parallelism`, Whether the Kubernetes Job would create `parallelism` Pods at the same time.
184+
- `completions`, The Job would become the success status only when the number of successful Pod(the exit code is 0)
185+
is equal to `completions`.
186+
- `volumeMounts`, the name field `jobpath` is a key, the `mountPath` field represents
187+
the path in the container, and we can define the `jobpath` in `volumes` filed, use `hostPath`
188+
to configure the host path we want to mount.
189+
- `env`, the environment variables in the Container, we pass some startup arguments by
190+
this approach, some details are as following:
191+
- JOB_PATH:the mount path in the container
192+
- JOB_NAME:the job name
193+
- TRAIN_CONFIG_DIR:the job path in the container, we can find the training data path by
194+
combine with JOB_NAME.
195+
- CONF_PADDLE_NIC: the argument `--nics` of `Paddle PServer` process, the network
196+
device name.
197+
- CONF_PADDLE_PORT: the argument `--port` of `Paddle PServer` process.
198+
- CONF_PADDLE_PORTS_NUM: the argument `--ports_num` of `Paddle PServer`, the port number
199+
for dense prameter update.
200+
- CONF_PADDLE_PORTS_NUM_SPARSE:the argument `--ports_num_for_sparse` of `Paddle PServer`,
201+
the port number for sparse parameter update.
202+
- CONF_PADDLE_GRADIENT_NUM:the number of training node, the argument
203+
`--num_gradient_servers` of `Paddle PServer` and `Paddle Trainer`.
204+
205+
You can find some details information at [here]
206+
(http://www.paddlepaddle.org/docs/develop/documentation/zh/howto/usage/cmd_parameter/detail_introduction_cn.html)。
207+
208+
We can use the command-line tool of Kubernetes to create a Job when we finish the YAML file:
209+
210+
```bash
211+
kubectl create -f job.yaml
212+
```
213+
214+
Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node,
215+
pull the Docker image and begin to train.
216+
217+
218+
### Checkout the Output
219+
220+
At the process of training, we can check the logs and the output models which is stored in
221+
the `output` folder.
222+
223+
**NOTE**, `node_0`, `node_1` and `node_2` represent the
224+
`trainer_id` of the PaddlePaddle training job rather than the node id of Kubernetes.
225+
226+
```bash
227+
[root@paddle-kubernetes-node0 output]# tree -d
228+
.
229+
├── node_0
230+
│   ├── server.log
231+
│   └── train.log
232+
├── node_1
233+
│   ├── server.log
234+
│   └── train.log
235+
├── node_2
236+
......
237+
├── pass-00002
238+
│   ├── done
239+
│   ├── ___embedding_0__.w0
240+
│   ├── ___embedding_1__.w0
241+
......
242+
```
243+
244+
We can checkout the status of each training Pod by viewing the logs:
245+
246+
```bash
247+
[root@paddle-kubernetes-node0 node_0]# cat train.log
248+
I1116 09:10:17.123121 50 Util.cpp:155] commandline:
249+
/usr/local/bin/../opt/paddle/bin/paddle_trainer
250+
--nics=eth0 --port=7164
251+
--ports_num=2 --comment=paddle_process_by_paddle
252+
--pservers=192.168.129.66,192.168.223.143,192.168.129.71
253+
--ports_num_for_sparse=2 --config=./trainer_config.py
254+
--trainer_count=4 --num_passes=10 --use_gpu=0
255+
--log_period=50 --dot_period=10 --saving_period=1
256+
--local=0 --trainer_id=0
257+
--save_dir=/home/jobpath/paddle-cluster-job/output
258+
I1116 09:10:17.123440 50 Util.cpp:130] Calling runInitFunctions
259+
I1116 09:10:17.123764 50 Util.cpp:143] Call runInitFunctions done.
260+
[WARNING 2016-11-16 09:10:17,227 default_decorators.py:40] please use keyword arguments in paddle config.
261+
[INFO 2016-11-16 09:10:17,239 networks.py:1282] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating]
262+
[INFO 2016-11-16 09:10:17,239 networks.py:1289] The output order is [__square_error_cost_0__]
263+
I1116 09:10:17.392917 50 Trainer.cpp:170] trainer mode: Normal
264+
I1116 09:10:17.613910 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process
265+
I1116 09:10:17.680917 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process
266+
I1116 09:10:17.681543 50 GradientMachine.cpp:134] Initing parameters..
267+
I1116 09:10:18.012390 50 GradientMachine.cpp:141] Init parameters done.
268+
I1116 09:10:18.018641 50 ParameterClient2.cpp:122] pserver 0 192.168.129.66:7164
269+
I1116 09:10:18.018950 50 ParameterClient2.cpp:122] pserver 1 192.168.129.66:7165
270+
I1116 09:10:18.019069 50 ParameterClient2.cpp:122] pserver 2 192.168.223.143:7164
271+
I1116 09:10:18.019492 50 ParameterClient2.cpp:122] pserver 3 192.168.223.143:7165
272+
I1116 09:10:18.019716 50 ParameterClient2.cpp:122] pserver 4 192.168.129.71:7164
273+
I1116 09:10:18.019836 50 ParameterClient2.cpp:122] pserver 5 192.168.129.71:7165
274+
```
275+
276+
## Some Additional Details
277+
278+
### Using Environment Variables
279+
280+
Usually we use the environment varialbes to configurate the PaddlePaddle Job which runs in
281+
Kubernetes, `start_paddle.py` provides a start up script to convert the environment variable
282+
to the start up arguments of PaddlePaddle process:
283+
284+
```bash
285+
API = "/api/v1/namespaces/"
286+
JOBSELECTOR = "labelSelector=job-name="
287+
JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME")
288+
JOB_PATH_OUTPUT = JOB_PATH + "/output"
289+
JOBNAME = os.getenv("JOB_NAME")
290+
NAMESPACE = os.getenv("JOB_NAMESPACE")
291+
PADDLE_NIC = os.getenv("CONF_PADDLE_NIC")
292+
PADDLE_PORT = os.getenv("CONF_PADDLE_PORT")
293+
PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM")
294+
PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE")
295+
PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM")
296+
```
297+
298+
### Communication between Pods
299+
300+
At the begin of `start_paddle.py`, it would initializes and parses the arguments.
301+
302+
```python
303+
parser = argparse.ArgumentParser(prog="start_paddle.py",
304+
description='simple tool for k8s')
305+
args, train_args_list = parser.parse_known_args()
306+
train_args = refine_unknown_args(train_args_list)
307+
train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2]))
308+
podlist = getPodList()
309+
```
310+
311+
And then query the status of all the other Pods of this Job by the function `getPodList()`, and fetch `triner_id` by the function `getIdMap(podlist)` if all the Pods status is `RUNNING`.
312+
313+
```python
314+
podlist = getPodList()
315+
# need to wait until all pods are running
316+
while not isPodAllRunning(podlist):
317+
time.sleep(10)
318+
podlist = getPodList()
319+
idMap = getIdMap(podlist)
320+
```
321+
322+
**NOTE**: `getPodList()` would prefetch all the Pods in the current namespace, if some
323+
Pods are alreay running, it may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
324+
Kubernetes Pod or Replicaset in the future.
325+
326+
The function `getIdMap(podlist)` fetches IPs addresses of `podlist` and then sort them
327+
to generate `trainer_id`.
328+
329+
```python
330+
def getIdMap(podlist):
331+
'''
332+
generate tainer_id by ip
333+
'''
334+
ips = []
335+
for pod in podlist["items"]:
336+
ips.append(pod["status"]["podIP"])
337+
ips.sort()
338+
idMap = {}
339+
for i in range(len(ips)):
340+
idMap[ips[i]] = i
341+
return idMap
342+
```
343+
344+
After getting the `idMap`, we can generate the arguments of `Paddle PServer` and `Paddle Trainer`
345+
so that we can start up them by `startPaddle(idMap, train_args_dict)`.
346+
347+
### Create Job
348+
349+
The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and
350+
`Paddle Trainer` processes. Take `Paddle Trainer` as an example, we parse the
351+
environment variable and then get `PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM` and etc...,
352+
finally find `trainerId` from `idMap` according to its IP address.
353+
354+
```python
355+
program = 'paddle train'
356+
args = " --nics=" + PADDLE_NIC
357+
args += " --port=" + str(PADDLE_PORT)
358+
args += " --ports_num=" + str(PADDLE_PORTS_NUM)
359+
args += " --comment=" + "paddle_process_by_paddle"
360+
ip_string = ""
361+
for ip in idMap.keys():
362+
ip_string += (ip + ",")
363+
ip_string = ip_string.rstrip(",")
364+
args += " --pservers=" + ip_string
365+
args_ext = ""
366+
for key, value in train_args_dict.items():
367+
args_ext += (' --' + key + '=' + value)
368+
localIP = socket.gethostbyname(socket.gethostname())
369+
trainerId = idMap[localIP]
370+
args += " " + args_ext + " --trainer_id=" + \
371+
str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT
372+
```

0 commit comments

Comments
 (0)