Skip to content

Commit d05071f

Browse files
committed
k8s dist train for en
1 parent 3874c38 commit d05071f

File tree

1 file changed

+364
-2
lines changed

1 file changed

+364
-2
lines changed
Lines changed: 364 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,365 @@
1-
# Kubernetes Distributed
1+
# Kubernetes Distributed Training
22

3-
TBD
3+
We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the
4+
previous document.
5+
In this article, we will introduce how to craete a PaddlePaddle job with multiple nodes
6+
on Kubernetes cluster.
7+
8+
## Overall Architecture
9+
10+
Before creating a training job, the users need to deploy the Python scripts and
11+
training data which have already been sliced on the precast path in the distributed file
12+
system(We can use the different type of Kuberentes Volumes to mount different distributed
13+
file system). Before start training, The program would copy the training data into the
14+
Container and also save the models at the same path during training. The global architecture
15+
is as follows:
16+
17+
![PaddlePaddle on Kubernetes Architecture](src/k8s-paddle-arch.png)
18+
19+
The above figure describes a distributed training architecture which contains 3 nodes, each
20+
Pod would mount a folder of the distributed file system to save training data and models
21+
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
22+
3 nodes, each Pod has a PaddlePaddle container. After the containers have been created,
23+
PaddlePaddle would start up the communication between PServer and Trainer and read training
24+
data for this training job.
25+
26+
As the description above, we can start up a PaddlePaddle distributed training job on a ready
27+
Kubernetes cluster as the following steps:
28+
29+
1. [Build PaddlePaddle Docker Image](#Build a Docker Image)
30+
1. [Split training data and upload to the distributed file system](#Upload Training Data)
31+
1. [Edit a YAML file and create a Kubernetes Job](#Create a Job)
32+
1. [Check the output](#Check The Output)
33+
34+
We will introduce these steps as follows:
35+
36+
### Build a Docker Image
37+
38+
PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and
39+
`Paddle Trainer` process and this Docker Image has the two import features:
40+
41+
- Copy the training data into the container.
42+
- Generate the start arguments of `Paddle PServer` and `Paddle Training` process.
43+
44+
Because of the official Docker Image `paddlepaddle/paddle:latest` has already included the
45+
PaddlePaddle executable file, but above features so that we can use the official Docker Image as
46+
a base Image and add some additional scripts to finish the work of building a new image.
47+
You can reference [Dockerfile](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile).
48+
49+
50+
```bash
51+
$ cd doc/howto/usage/k8s/src/k8s_train
52+
$ docker build -t [YOUR_REPO]/paddle:mypaddle .
53+
```
54+
55+
And then upload the new Docker Image to a Docker hub:
56+
57+
```bash
58+
docker push [YOUR_REPO]/paddle:mypaddle
59+
```
60+
61+
**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository,
62+
you need to use your repository instead of it. We will use `[YOUR_REPO]/paddle:mypaddle` to
63+
represent the Docker Image which built in this step.
64+
65+
### Prepare Training Data
66+
67+
We can download and split the training job by creating a Kubernetes Job, or custom your image
68+
by editing [k8s_train](./src/k8s_train/README.md).
69+
70+
Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of
71+
the different distributed file system, the generated dataset would be saved on this volume.
72+
73+
```yaml
74+
apiVersion: batch/v1
75+
kind: Job
76+
metadata:
77+
name: paddle-data
78+
spec:
79+
template:
80+
metadata:
81+
name: pi
82+
spec:
83+
hostNetwork: true
84+
containers:
85+
- name: paddle-data
86+
image: paddlepaddle/paddle-tutorial:k8s_data
87+
imagePullPolicy: Always
88+
volumeMounts:
89+
- mountPath: "/mnt"
90+
name: nfs
91+
env:
92+
- name: OUT_DIR
93+
value: /home/work/mfs/paddle-cluster-job
94+
- name: SPLIT_COUNT
95+
value: "3"
96+
volumes:
97+
- name: nfs
98+
persistentVolumeClaim:
99+
claimName: mfs
100+
restartPolicy: Never
101+
```
102+
103+
If success, you can see some information like this:
104+
105+
```base
106+
[root@paddle-kubernetes-node0 nfsdir]$ tree -d
107+
.
108+
`-- paddle-cluster-job
109+
|-- 0
110+
| `-- data
111+
|-- 1
112+
| `-- data
113+
|-- 2
114+
| `-- data
115+
|-- output
116+
|-- quick_start
117+
```
118+
119+
The `paddle-cluster-job` above is the job name for this training job; we need 3
120+
PaddlePaddle training node and save the split training data on `paddle-cluster-job` path,
121+
the folder `0`, `1` and `2` representative the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs.
122+
123+
124+
### Create a Job
125+
126+
Kubernetes allow users to create an object with YAML files, and we can use a command-line tool
127+
to create it.
128+
129+
The Job YAML file describes that which Docker Image would be used in this training job, how much nodes would be created, what's the startup arguments of `Paddle PServer/Trainer` process and what's the type of Volumes. You can find the details of the YAML filed in
130+
[Kubernetes Job API](http://kubernetes.io/docs/api-reference/batch/v1/definitions/#_v1_job).
131+
The following is an example for this training job:
132+
133+
```yaml
134+
apiVersion: batch/v1
135+
kind: Job
136+
metadata:
137+
name: paddle-cluster-job
138+
spec:
139+
parallelism: 3
140+
completions: 3
141+
template:
142+
metadata:
143+
name: paddle-cluster-job
144+
spec:
145+
volumes:
146+
- name: jobpath
147+
hostPath:
148+
path: /home/work/mfs
149+
containers:
150+
- name: trainer
151+
image: [YOUR_REPO]/paddle:mypaddle
152+
command: ["bin/bash", "-c", "/root/start.sh"]
153+
env:
154+
- name: JOB_NAME
155+
value: paddle-cluster-job
156+
- name: JOB_PATH
157+
value: /home/jobpath
158+
- name: JOB_NAMESPACE
159+
value: default
160+
- name: TRAIN_CONFIG_DIR
161+
value: recommendation
162+
- name: CONF_PADDLE_NIC
163+
value: eth0
164+
- name: CONF_PADDLE_PORT
165+
value: "7164"
166+
- name: CONF_PADDLE_PORTS_NUM
167+
value: "2"
168+
- name: CONF_PADDLE_PORTS_NUM_SPARSE
169+
value: "2"
170+
- name: CONF_PADDLE_GRADIENT_NUM
171+
value: "3"
172+
volumeMounts:
173+
- name: jobpath
174+
mountPath: /home/jobpath
175+
restartPolicy: Never
176+
```
177+
178+
In the above YAML file:
179+
- `metadata.name`, The job name.
180+
- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time.
181+
- `completions`, The Job would become the success status only the number of successful Pod(the exit code is 0)
182+
is equal to `completions`.
183+
- `volumeMounts`, the name field `jobpath` is a key, the `mountPath` field represents
184+
the path in the container, and we can define the `jobpath` in `volumes` filed, use `hostPath`
185+
to configure the host path we want to mount.
186+
- `env`, the environment variables in the Container, we pass some startup arguments by
187+
this approach, some details are as following:
188+
- JOB_PATH:the mount path in the container
189+
- JOB_NAME:the job name
190+
- TRAIN_CONFIG_DIR:the job path in the container, we can find the training data path by
191+
combine with JOB_NAME.
192+
- CONF_PADDLE_NIC: the argument `--nics` of `Paddle PServer` process, the network
193+
device name.
194+
- CONF_PADDLE_PORT: the argument `--port` of `Paddle PServer` process.
195+
- CONF_PADDLE_PORTS_NUM: the argument `--ports_num` of `Paddle PServer`, the port number
196+
for dense prameter update.
197+
- CONF_PADDLE_PORTS_NUM_SPARSE:the argument `--ports_num_for_sparse` of `Paddle PServer`,
198+
the port number for sparse parameter update.
199+
- CONF_PADDLE_GRADIENT_NUM:the number of training node, the argument
200+
`--num_gradient_servers` of `Paddle PServer` and `Paddle Trainer`.
201+
202+
You can find some details information at [here]
203+
(http://www.paddlepaddle.org/docs/develop/documentation/zh/howto/usage/cmd_parameter/detail_introduction_cn.html)。
204+
205+
We can use the command-line tool of Kubernetes to create a Job when we finish the YAML file:
206+
207+
```bash
208+
kubectl create -f job.yaml
209+
```
210+
211+
Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node,
212+
, pull the Docker image and begin to train.
213+
214+
215+
### Checkout the Output
216+
217+
At the process of training, we can check the logs and the output models, such as we store
218+
the output on `output` folder. **NOTE**, `node_0`, `node_1` and `node_2` represent the
219+
`trainer_id` of the PaddlePaddle training job rather than the node id of Kubernetes.
220+
221+
```bash
222+
[root@paddle-kubernetes-node0 output]# tree -d
223+
.
224+
├── node_0
225+
│   ├── server.log
226+
│   └── train.log
227+
├── node_1
228+
│   ├── server.log
229+
│   └── train.log
230+
├── node_2
231+
......
232+
├── pass-00002
233+
│   ├── done
234+
│   ├── ___embedding_0__.w0
235+
│   ├── ___embedding_1__.w0
236+
......
237+
```
238+
239+
We can checkout the status of each training Pod by viewing the logs:
240+
241+
```bash
242+
[root@paddle-kubernetes-node0 node_0]# cat train.log
243+
I1116 09:10:17.123121 50 Util.cpp:155] commandline:
244+
/usr/local/bin/../opt/paddle/bin/paddle_trainer
245+
--nics=eth0 --port=7164
246+
--ports_num=2 --comment=paddle_process_by_paddle
247+
--pservers=192.168.129.66,192.168.223.143,192.168.129.71
248+
--ports_num_for_sparse=2 --config=./trainer_config.py
249+
--trainer_count=4 --num_passes=10 --use_gpu=0
250+
--log_period=50 --dot_period=10 --saving_period=1
251+
--local=0 --trainer_id=0
252+
--save_dir=/home/jobpath/paddle-cluster-job/output
253+
I1116 09:10:17.123440 50 Util.cpp:130] Calling runInitFunctions
254+
I1116 09:10:17.123764 50 Util.cpp:143] Call runInitFunctions done.
255+
[WARNING 2016-11-16 09:10:17,227 default_decorators.py:40] please use keyword arguments in paddle config.
256+
[INFO 2016-11-16 09:10:17,239 networks.py:1282] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating]
257+
[INFO 2016-11-16 09:10:17,239 networks.py:1289] The output order is [__square_error_cost_0__]
258+
I1116 09:10:17.392917 50 Trainer.cpp:170] trainer mode: Normal
259+
I1116 09:10:17.613910 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process
260+
I1116 09:10:17.680917 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process
261+
I1116 09:10:17.681543 50 GradientMachine.cpp:134] Initing parameters..
262+
I1116 09:10:18.012390 50 GradientMachine.cpp:141] Init parameters done.
263+
I1116 09:10:18.018641 50 ParameterClient2.cpp:122] pserver 0 192.168.129.66:7164
264+
I1116 09:10:18.018950 50 ParameterClient2.cpp:122] pserver 1 192.168.129.66:7165
265+
I1116 09:10:18.019069 50 ParameterClient2.cpp:122] pserver 2 192.168.223.143:7164
266+
I1116 09:10:18.019492 50 ParameterClient2.cpp:122] pserver 3 192.168.223.143:7165
267+
I1116 09:10:18.019716 50 ParameterClient2.cpp:122] pserver 4 192.168.129.71:7164
268+
I1116 09:10:18.019836 50 ParameterClient2.cpp:122] pserver 5 192.168.129.71:7165
269+
```
270+
271+
## Some Additional Details
272+
273+
### Using Environment Variables
274+
275+
Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on
276+
Kubernetes, `start_paddle.py` provides a start up script to convert the environment variable
277+
to the start up argument of PaddlePaddle process:
278+
279+
```bash
280+
API = "/api/v1/namespaces/"
281+
JOBSELECTOR = "labelSelector=job-name="
282+
JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME")
283+
JOB_PATH_OUTPUT = JOB_PATH + "/output"
284+
JOBNAME = os.getenv("JOB_NAME")
285+
NAMESPACE = os.getenv("JOB_NAMESPACE")
286+
PADDLE_NIC = os.getenv("CONF_PADDLE_NIC")
287+
PADDLE_PORT = os.getenv("CONF_PADDLE_PORT")
288+
PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM")
289+
PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE")
290+
PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM")
291+
```
292+
293+
### Communication between Pods
294+
295+
At the begin of `start_paddle.py`, it would initialize and parse the arguments.
296+
297+
```python
298+
parser = argparse.ArgumentParser(prog="start_paddle.py",
299+
description='simple tool for k8s')
300+
args, train_args_list = parser.parse_known_args()
301+
train_args = refine_unknown_args(train_args_list)
302+
train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2]))
303+
podlist = getPodList()
304+
```
305+
306+
And then query the status of all the other Pods of this Job by the function `getPodList()`, and fetch `triner_id` by the function `getIdMap(podlist)` if all the Pods status is `RUNNING`.
307+
308+
```python
309+
podlist = getPodList()
310+
# need to wait until all pods are running
311+
while not isPodAllRunning(podlist):
312+
time.sleep(10)
313+
podlist = getPodList()
314+
idMap = getIdMap(podlist)
315+
```
316+
317+
**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
318+
Kubernetes Pod or Replicaset in the future.
319+
320+
For the implement of `getIdMap(podlist)`, this function would fetch each IP address of
321+
`podlist` and then sort them to generate `trainer_id`.
322+
323+
```python
324+
def getIdMap(podlist):
325+
'''
326+
generate tainer_id by ip
327+
'''
328+
ips = []
329+
for pod in podlist["items"]:
330+
ips.append(pod["status"]["podIP"])
331+
ips.sort()
332+
idMap = {}
333+
for i in range(len(ips)):
334+
idMap[ips[i]] = i
335+
return idMap
336+
```
337+
338+
After getting the `idMap`, we can generate the arguments of `Paddle PServer` and `Paddle Trainer`
339+
so that we can start up them by `startPaddle(idMap, train_args_dict)`.
340+
341+
### Create Job
342+
343+
The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get
344+
`PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM` and etc..., finally find `trainerId` from
345+
`idMap` according to its IP address.
346+
347+
```python
348+
program = 'paddle train'
349+
args = " --nics=" + PADDLE_NIC
350+
args += " --port=" + str(PADDLE_PORT)
351+
args += " --ports_num=" + str(PADDLE_PORTS_NUM)
352+
args += " --comment=" + "paddle_process_by_paddle"
353+
ip_string = ""
354+
for ip in idMap.keys():
355+
ip_string += (ip + ",")
356+
ip_string = ip_string.rstrip(",")
357+
args += " --pservers=" + ip_string
358+
args_ext = ""
359+
for key, value in train_args_dict.items():
360+
args_ext += (' --' + key + '=' + value)
361+
localIP = socket.gethostbyname(socket.gethostname())
362+
trainerId = idMap[localIP]
363+
args += " " + args_ext + " --trainer_id=" + \
364+
str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT
365+
```

0 commit comments

Comments
 (0)