|
1 |
| -# Kubernetes Distributed |
| 1 | +# Kubernetes Distributed Training |
2 | 2 |
|
3 |
| -TBD |
| 3 | +We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the |
| 4 | +previous document. |
| 5 | +In this article, we will introduce how to craete a PaddlePaddle job with multiple nodes |
| 6 | +on Kubernetes cluster. |
| 7 | + |
| 8 | +## Overall Architecture |
| 9 | + |
| 10 | +Before creating a training job, the users need to deploy the Python scripts and |
| 11 | +training data which have already been sliced on the precast path in the distributed file |
| 12 | +system(We can use the different type of Kuberentes Volumes to mount different distributed |
| 13 | +file system). Before start training, The program would copy the training data into the |
| 14 | +Container and also save the models at the same path during training. The global architecture |
| 15 | +is as follows: |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +The above figure describes a distributed training architecture which contains 3 nodes, each |
| 20 | +Pod would mount a folder of the distributed file system to save training data and models |
| 21 | +by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on |
| 22 | +3 nodes, each Pod has a PaddlePaddle container. After the containers have been created, |
| 23 | +PaddlePaddle would start up the communication between PServer and Trainer and read training |
| 24 | +data for this training job. |
| 25 | + |
| 26 | +As the description above, we can start up a PaddlePaddle distributed training job on a ready |
| 27 | +Kubernetes cluster as the following steps: |
| 28 | + |
| 29 | +1. [Build PaddlePaddle Docker Image](#Build a Docker Image) |
| 30 | +1. [Split training data and upload to the distributed file system](#Upload Training Data) |
| 31 | +1. [Edit a YAML file and create a Kubernetes Job](#Create a Job) |
| 32 | +1. [Check the output](#Check The Output) |
| 33 | + |
| 34 | +We will introduce these steps as follows: |
| 35 | + |
| 36 | +### Build a Docker Image |
| 37 | + |
| 38 | +PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and |
| 39 | +`Paddle Trainer` process and this Docker Image has the two import features: |
| 40 | + |
| 41 | +- Copy the training data into the container. |
| 42 | +- Generate the start arguments of `Paddle PServer` and `Paddle Training` process. |
| 43 | + |
| 44 | +Because of the official Docker Image `paddlepaddle/paddle:latest` has already included the |
| 45 | +PaddlePaddle executable file, but above features so that we can use the official Docker Image as |
| 46 | +a base Image and add some additional scripts to finish the work of building a new image. |
| 47 | +You can reference [Dockerfile](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile). |
| 48 | + |
| 49 | + |
| 50 | +```bash |
| 51 | +$ cd doc/howto/usage/k8s/src/k8s_train |
| 52 | +$ docker build -t [YOUR_REPO]/paddle:mypaddle . |
| 53 | +``` |
| 54 | + |
| 55 | +And then upload the new Docker Image to a Docker hub: |
| 56 | + |
| 57 | +```bash |
| 58 | +docker push [YOUR_REPO]/paddle:mypaddle |
| 59 | +``` |
| 60 | + |
| 61 | +**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository, |
| 62 | +you need to use your repository instead of it. We will use `[YOUR_REPO]/paddle:mypaddle` to |
| 63 | +represent the Docker Image which built in this step. |
| 64 | + |
| 65 | +### Prepare Training Data |
| 66 | + |
| 67 | +We can download and split the training job by creating a Kubernetes Job, or custom your image |
| 68 | +by editing [k8s_train](./src/k8s_train/README.md). |
| 69 | + |
| 70 | +Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of |
| 71 | +the different distributed file system, the generated dataset would be saved on this volume. |
| 72 | + |
| 73 | +```yaml |
| 74 | +apiVersion: batch/v1 |
| 75 | +kind: Job |
| 76 | +metadata: |
| 77 | + name: paddle-data |
| 78 | +spec: |
| 79 | + template: |
| 80 | + metadata: |
| 81 | + name: pi |
| 82 | + spec: |
| 83 | + hostNetwork: true |
| 84 | + containers: |
| 85 | + - name: paddle-data |
| 86 | + image: paddlepaddle/paddle-tutorial:k8s_data |
| 87 | + imagePullPolicy: Always |
| 88 | + volumeMounts: |
| 89 | + - mountPath: "/mnt" |
| 90 | + name: nfs |
| 91 | + env: |
| 92 | + - name: OUT_DIR |
| 93 | + value: /home/work/mfs/paddle-cluster-job |
| 94 | + - name: SPLIT_COUNT |
| 95 | + value: "3" |
| 96 | + volumes: |
| 97 | + - name: nfs |
| 98 | + persistentVolumeClaim: |
| 99 | + claimName: mfs |
| 100 | + restartPolicy: Never |
| 101 | +``` |
| 102 | +
|
| 103 | +If success, you can see some information like this: |
| 104 | +
|
| 105 | +```base |
| 106 | +[root@paddle-kubernetes-node0 nfsdir]$ tree -d |
| 107 | +. |
| 108 | +`-- paddle-cluster-job |
| 109 | + |-- 0 |
| 110 | + | `-- data |
| 111 | + |-- 1 |
| 112 | + | `-- data |
| 113 | + |-- 2 |
| 114 | + | `-- data |
| 115 | + |-- output |
| 116 | + |-- quick_start |
| 117 | +``` |
| 118 | + |
| 119 | +The `paddle-cluster-job` above is the job name for this training job; we need 3 |
| 120 | +PaddlePaddle training node and save the split training data on `paddle-cluster-job` path, |
| 121 | +the folder `0`, `1` and `2` representative the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs. |
| 122 | + |
| 123 | + |
| 124 | +### Create a Job |
| 125 | + |
| 126 | +Kubernetes allow users to create an object with YAML files, and we can use a command-line tool |
| 127 | +to create it. |
| 128 | + |
| 129 | +The Job YAML file describes that which Docker Image would be used in this training job, how much nodes would be created, what's the startup arguments of `Paddle PServer/Trainer` process and what's the type of Volumes. You can find the details of the YAML filed in |
| 130 | +[Kubernetes Job API](http://kubernetes.io/docs/api-reference/batch/v1/definitions/#_v1_job). |
| 131 | +The following is an example for this training job: |
| 132 | + |
| 133 | +```yaml |
| 134 | +apiVersion: batch/v1 |
| 135 | +kind: Job |
| 136 | +metadata: |
| 137 | + name: paddle-cluster-job |
| 138 | +spec: |
| 139 | + parallelism: 3 |
| 140 | + completions: 3 |
| 141 | + template: |
| 142 | + metadata: |
| 143 | + name: paddle-cluster-job |
| 144 | + spec: |
| 145 | + volumes: |
| 146 | + - name: jobpath |
| 147 | + hostPath: |
| 148 | + path: /home/work/mfs |
| 149 | + containers: |
| 150 | + - name: trainer |
| 151 | + image: [YOUR_REPO]/paddle:mypaddle |
| 152 | + command: ["bin/bash", "-c", "/root/start.sh"] |
| 153 | + env: |
| 154 | + - name: JOB_NAME |
| 155 | + value: paddle-cluster-job |
| 156 | + - name: JOB_PATH |
| 157 | + value: /home/jobpath |
| 158 | + - name: JOB_NAMESPACE |
| 159 | + value: default |
| 160 | + - name: TRAIN_CONFIG_DIR |
| 161 | + value: recommendation |
| 162 | + - name: CONF_PADDLE_NIC |
| 163 | + value: eth0 |
| 164 | + - name: CONF_PADDLE_PORT |
| 165 | + value: "7164" |
| 166 | + - name: CONF_PADDLE_PORTS_NUM |
| 167 | + value: "2" |
| 168 | + - name: CONF_PADDLE_PORTS_NUM_SPARSE |
| 169 | + value: "2" |
| 170 | + - name: CONF_PADDLE_GRADIENT_NUM |
| 171 | + value: "3" |
| 172 | + volumeMounts: |
| 173 | + - name: jobpath |
| 174 | + mountPath: /home/jobpath |
| 175 | + restartPolicy: Never |
| 176 | +``` |
| 177 | +
|
| 178 | +In the above YAML file: |
| 179 | +- `metadata.name`, The job name. |
| 180 | +- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time. |
| 181 | +- `completions`, The Job would become the success status only the number of successful Pod(the exit code is 0) |
| 182 | + is equal to `completions`. |
| 183 | +- `volumeMounts`, the name field `jobpath` is a key, the `mountPath` field represents |
| 184 | + the path in the container, and we can define the `jobpath` in `volumes` filed, use `hostPath` |
| 185 | + to configure the host path we want to mount. |
| 186 | +- `env`, the environment variables in the Container, we pass some startup arguments by |
| 187 | + this approach, some details are as following: |
| 188 | + - JOB_PATH:the mount path in the container |
| 189 | + - JOB_NAME:the job name |
| 190 | + - TRAIN_CONFIG_DIR:the job path in the container, we can find the training data path by |
| 191 | + combine with JOB_NAME. |
| 192 | + - CONF_PADDLE_NIC: the argument `--nics` of `Paddle PServer` process, the network |
| 193 | + device name. |
| 194 | + - CONF_PADDLE_PORT: the argument `--port` of `Paddle PServer` process. |
| 195 | + - CONF_PADDLE_PORTS_NUM: the argument `--ports_num` of `Paddle PServer`, the port number |
| 196 | + for dense prameter update. |
| 197 | + - CONF_PADDLE_PORTS_NUM_SPARSE:the argument `--ports_num_for_sparse` of `Paddle PServer`, |
| 198 | + the port number for sparse parameter update. |
| 199 | + - CONF_PADDLE_GRADIENT_NUM:the number of training node, the argument |
| 200 | + `--num_gradient_servers` of `Paddle PServer` and `Paddle Trainer`. |
| 201 | + |
| 202 | +You can find some details information at [here] |
| 203 | +(http://www.paddlepaddle.org/docs/develop/documentation/zh/howto/usage/cmd_parameter/detail_introduction_cn.html)。 |
| 204 | + |
| 205 | +We can use the command-line tool of Kubernetes to create a Job when we finish the YAML file: |
| 206 | + |
| 207 | +```bash |
| 208 | +kubectl create -f job.yaml |
| 209 | +``` |
| 210 | + |
| 211 | +Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node, |
| 212 | +, pull the Docker image and begin to train. |
| 213 | + |
| 214 | + |
| 215 | +### Checkout the Output |
| 216 | + |
| 217 | +At the process of training, we can check the logs and the output models, such as we store |
| 218 | +the output on `output` folder. **NOTE**, `node_0`, `node_1` and `node_2` represent the |
| 219 | +`trainer_id` of the PaddlePaddle training job rather than the node id of Kubernetes. |
| 220 | + |
| 221 | +```bash |
| 222 | +[root@paddle-kubernetes-node0 output]# tree -d |
| 223 | +. |
| 224 | +├── node_0 |
| 225 | +│ ├── server.log |
| 226 | +│ └── train.log |
| 227 | +├── node_1 |
| 228 | +│ ├── server.log |
| 229 | +│ └── train.log |
| 230 | +├── node_2 |
| 231 | +...... |
| 232 | +├── pass-00002 |
| 233 | +│ ├── done |
| 234 | +│ ├── ___embedding_0__.w0 |
| 235 | +│ ├── ___embedding_1__.w0 |
| 236 | +...... |
| 237 | +``` |
| 238 | + |
| 239 | +We can checkout the status of each training Pod by viewing the logs: |
| 240 | + |
| 241 | +```bash |
| 242 | +[root@paddle-kubernetes-node0 node_0]# cat train.log |
| 243 | +I1116 09:10:17.123121 50 Util.cpp:155] commandline: |
| 244 | + /usr/local/bin/../opt/paddle/bin/paddle_trainer |
| 245 | + --nics=eth0 --port=7164 |
| 246 | + --ports_num=2 --comment=paddle_process_by_paddle |
| 247 | + --pservers=192.168.129.66,192.168.223.143,192.168.129.71 |
| 248 | + --ports_num_for_sparse=2 --config=./trainer_config.py |
| 249 | + --trainer_count=4 --num_passes=10 --use_gpu=0 |
| 250 | + --log_period=50 --dot_period=10 --saving_period=1 |
| 251 | + --local=0 --trainer_id=0 |
| 252 | + --save_dir=/home/jobpath/paddle-cluster-job/output |
| 253 | +I1116 09:10:17.123440 50 Util.cpp:130] Calling runInitFunctions |
| 254 | +I1116 09:10:17.123764 50 Util.cpp:143] Call runInitFunctions done. |
| 255 | +[WARNING 2016-11-16 09:10:17,227 default_decorators.py:40] please use keyword arguments in paddle config. |
| 256 | +[INFO 2016-11-16 09:10:17,239 networks.py:1282] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating] |
| 257 | +[INFO 2016-11-16 09:10:17,239 networks.py:1289] The output order is [__square_error_cost_0__] |
| 258 | +I1116 09:10:17.392917 50 Trainer.cpp:170] trainer mode: Normal |
| 259 | +I1116 09:10:17.613910 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process |
| 260 | +I1116 09:10:17.680917 50 PyDataProvider2.cpp:257] loading dataprovider dataprovider::process |
| 261 | +I1116 09:10:17.681543 50 GradientMachine.cpp:134] Initing parameters.. |
| 262 | +I1116 09:10:18.012390 50 GradientMachine.cpp:141] Init parameters done. |
| 263 | +I1116 09:10:18.018641 50 ParameterClient2.cpp:122] pserver 0 192.168.129.66:7164 |
| 264 | +I1116 09:10:18.018950 50 ParameterClient2.cpp:122] pserver 1 192.168.129.66:7165 |
| 265 | +I1116 09:10:18.019069 50 ParameterClient2.cpp:122] pserver 2 192.168.223.143:7164 |
| 266 | +I1116 09:10:18.019492 50 ParameterClient2.cpp:122] pserver 3 192.168.223.143:7165 |
| 267 | +I1116 09:10:18.019716 50 ParameterClient2.cpp:122] pserver 4 192.168.129.71:7164 |
| 268 | +I1116 09:10:18.019836 50 ParameterClient2.cpp:122] pserver 5 192.168.129.71:7165 |
| 269 | +``` |
| 270 | + |
| 271 | +## Some Additional Details |
| 272 | + |
| 273 | +### Using Environment Variables |
| 274 | + |
| 275 | +Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on |
| 276 | +Kubernetes, `start_paddle.py` provides a start up script to convert the environment variable |
| 277 | +to the start up argument of PaddlePaddle process: |
| 278 | + |
| 279 | +```bash |
| 280 | +API = "/api/v1/namespaces/" |
| 281 | +JOBSELECTOR = "labelSelector=job-name=" |
| 282 | +JOB_PATH = os.getenv("JOB_PATH") + "/" + os.getenv("JOB_NAME") |
| 283 | +JOB_PATH_OUTPUT = JOB_PATH + "/output" |
| 284 | +JOBNAME = os.getenv("JOB_NAME") |
| 285 | +NAMESPACE = os.getenv("JOB_NAMESPACE") |
| 286 | +PADDLE_NIC = os.getenv("CONF_PADDLE_NIC") |
| 287 | +PADDLE_PORT = os.getenv("CONF_PADDLE_PORT") |
| 288 | +PADDLE_PORTS_NUM = os.getenv("CONF_PADDLE_PORTS_NUM") |
| 289 | +PADDLE_PORTS_NUM_SPARSE = os.getenv("CONF_PADDLE_PORTS_NUM_SPARSE") |
| 290 | +PADDLE_SERVER_NUM = os.getenv("CONF_PADDLE_GRADIENT_NUM") |
| 291 | +``` |
| 292 | + |
| 293 | +### Communication between Pods |
| 294 | + |
| 295 | +At the begin of `start_paddle.py`, it would initialize and parse the arguments. |
| 296 | + |
| 297 | +```python |
| 298 | +parser = argparse.ArgumentParser(prog="start_paddle.py", |
| 299 | + description='simple tool for k8s') |
| 300 | + args, train_args_list = parser.parse_known_args() |
| 301 | + train_args = refine_unknown_args(train_args_list) |
| 302 | + train_args_dict = dict(zip(train_args[:-1:2], train_args[1::2])) |
| 303 | + podlist = getPodList() |
| 304 | +``` |
| 305 | + |
| 306 | +And then query the status of all the other Pods of this Job by the function `getPodList()`, and fetch `triner_id` by the function `getIdMap(podlist)` if all the Pods status is `RUNNING`. |
| 307 | + |
| 308 | +```python |
| 309 | + podlist = getPodList() |
| 310 | + # need to wait until all pods are running |
| 311 | + while not isPodAllRunning(podlist): |
| 312 | + time.sleep(10) |
| 313 | + podlist = getPodList() |
| 314 | + idMap = getIdMap(podlist) |
| 315 | +``` |
| 316 | + |
| 317 | +**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of |
| 318 | +Kubernetes Pod or Replicaset in the future. |
| 319 | + |
| 320 | +For the implement of `getIdMap(podlist)`, this function would fetch each IP address of |
| 321 | +`podlist` and then sort them to generate `trainer_id`. |
| 322 | + |
| 323 | +```python |
| 324 | +def getIdMap(podlist): |
| 325 | + ''' |
| 326 | + generate tainer_id by ip |
| 327 | + ''' |
| 328 | + ips = [] |
| 329 | + for pod in podlist["items"]: |
| 330 | + ips.append(pod["status"]["podIP"]) |
| 331 | + ips.sort() |
| 332 | + idMap = {} |
| 333 | + for i in range(len(ips)): |
| 334 | + idMap[ips[i]] = i |
| 335 | + return idMap |
| 336 | +``` |
| 337 | + |
| 338 | +After getting the `idMap`, we can generate the arguments of `Paddle PServer` and `Paddle Trainer` |
| 339 | +so that we can start up them by `startPaddle(idMap, train_args_dict)`. |
| 340 | + |
| 341 | +### Create Job |
| 342 | + |
| 343 | +The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get |
| 344 | +`PADDLE_NIC`, `PADDLE_PORT`, `PADDLE_PORTS_NUM` and etc..., finally find `trainerId` from |
| 345 | +`idMap` according to its IP address. |
| 346 | + |
| 347 | +```python |
| 348 | + program = 'paddle train' |
| 349 | + args = " --nics=" + PADDLE_NIC |
| 350 | + args += " --port=" + str(PADDLE_PORT) |
| 351 | + args += " --ports_num=" + str(PADDLE_PORTS_NUM) |
| 352 | + args += " --comment=" + "paddle_process_by_paddle" |
| 353 | + ip_string = "" |
| 354 | + for ip in idMap.keys(): |
| 355 | + ip_string += (ip + ",") |
| 356 | + ip_string = ip_string.rstrip(",") |
| 357 | + args += " --pservers=" + ip_string |
| 358 | + args_ext = "" |
| 359 | + for key, value in train_args_dict.items(): |
| 360 | + args_ext += (' --' + key + '=' + value) |
| 361 | + localIP = socket.gethostbyname(socket.gethostname()) |
| 362 | + trainerId = idMap[localIP] |
| 363 | + args += " " + args_ext + " --trainer_id=" + \ |
| 364 | + str(trainerId) + " --save_dir=" + JOB_PATH_OUTPUT |
| 365 | +``` |
0 commit comments