|
1 | 1 | # ElasticDL on On-prem Cluster
|
2 | 2 |
|
3 |
| -## Environment preparation |
| 3 | +## Environment Preparation |
4 | 4 |
|
5 |
| -You should install ElasticDL first. Please refer to the installation part in |
6 |
| -[elastic_local](elasticdl_local.md) doc. |
| 5 | +In order to find and access the on-premise cluster, ElasticDL needs a |
| 6 | +[kubeconfig file](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters) |
| 7 | +, which is located at `~/.kube/config` by default. |
7 | 8 |
|
8 |
| -Then, build needed images. |
| 9 | +We also need to install ElasticDL client. |
9 | 10 |
|
10 | 11 | ```bash
|
11 |
| -export TRAVIS_BUILD_DIR=$PWD |
12 |
| -bash scripts/travis/build_images.sh |
| 12 | +pip install elasticdl-client |
13 | 13 | ```
|
14 | 14 |
|
15 |
| -## Submit job to cluster |
| 15 | +## Submit Job to Cluster |
16 | 16 |
|
17 |
| -The submit command is similar to local mode. The local scripts will be built |
18 |
| -into a docker image, and pushed to `$DOCKER_HUB_REPO` remote docker hub. |
19 |
| - |
20 |
| -Following is an exmaple: |
| 17 | +The job submission steps are similar to public cloud mode. Please |
| 18 | +refer to [Submit Job](elasticdl_cloud.md#submit-job-to-the-kubernetes-cluster) |
| 19 | +section in [ElasticDL on Public Cloud tutorial](elasticdl_cloud.md) |
| 20 | +for detail. The difference is that we are not restricted to google cloud |
| 21 | +repo. So we can push the image to any remote docker hub that the on-premise |
| 22 | +cluster can access. |
21 | 23 |
|
22 | 24 | ```bash
|
23 | 25 | export DOCKER_HUB_REPO=reg.docker.com/user/
|
| 26 | + |
| 27 | +cd ${CODE_PATH}/elasticdl/model_zoo |
| 28 | + |
| 29 | +elasticdl zoo init |
| 30 | + |
| 31 | +elasticdl zoo build --image=${DOCKER_HUB_REPO}/elasticdl:mnist . |
| 32 | + |
| 33 | +elasticdl zoo push ${DOCKER_HUB_REPO}/elasticdl:mnist |
24 | 34 | ```
|
25 | 35 |
|
| 36 | +We launch a training job with 2 PS pods and 4 worker pods. |
| 37 | + |
26 | 38 | ```bash
|
27 | 39 | elasticdl train \
|
28 |
| - --image_base=elasticdl:ci \ |
29 |
| - --docker_image_prefix=$DOCKER_HUB_REPO \ |
30 |
| - --model_zoo=./model_zoo \ |
31 |
| - --model_def=mnist_functional_api.mnist_functional_api.custom_model \ |
32 |
| - --training_data=/data/mnist/train \ |
33 |
| - --validation_data=/data/mnist/test \ |
34 |
| - --num_epochs=2 \ |
35 |
| - --master_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \ |
36 |
| - --worker_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \ |
37 |
| - --minibatch_size=64 \ |
38 |
| - --num_minibatches_per_task=2 \ |
39 |
| - --num_workers=2 \ |
40 |
| - --checkpoint_steps=10 \ |
41 |
| - --grads_to_wait=2 \ |
42 |
| - --job_name=test-mnist \ |
43 |
| - --log_level=INFO \ |
44 |
| - --image_pull_policy=Always \ |
45 |
| - --namespace=kubemaker |
| 40 | + --image_name=${DOCKER_HUB_REPO}/elasticdl:mnist \ |
| 41 | + --model_zoo=model_zoo \ |
| 42 | + --model_def=mnist_functional_api.mnist_functional_api.custom_model \ |
| 43 | + --training_data=/data/mnist/train \ |
| 44 | + --validation_data=/data/mnist/test \ |
| 45 | + --num_epochs=5 \ |
| 46 | + --master_resource_request="cpu=2,memory=2048Mi" \ |
| 47 | + --master_resource_limit="cpu=2,memory=2048Mi" \ |
| 48 | + --master_pod_priority=high \ |
| 49 | + --worker_resource_request="cpu=2,memory=2048Mi" \ |
| 50 | + --worker_resource_limit="cpu=2,memory=2048Mi" \ |
| 51 | + --worker_pod_priority=low \ |
| 52 | + --ps_resource_request="cpu=2,memory=2048Mi" \ |
| 53 | + --ps_resource_limit="cpu=2,memory=2048Mi" \ |
| 54 | + --ps_pod_priority=high \ |
| 55 | + --minibatch_size=64 \ |
| 56 | + --num_minibatches_per_task=64 \ |
| 57 | + --num_ps_pods=2 \ |
| 58 | + --num_workers=4 \ |
| 59 | + --evaluation_steps=200 \ |
| 60 | + --grads_to_wait=1 \ |
| 61 | + --job_name=test-mnist \ |
| 62 | + --log_level=INFO \ |
| 63 | + --image_pull_policy=Always \ |
| 64 | + --volume="mount_path=/data,claim_name=fileserver-claim" \ |
| 65 | + --distribution_strategy=ParameterServerStrategy |
46 | 66 | ```
|
47 | 67 |
|
48 |
| -Then the job will be launched on the cluster. |
| 68 | +## Add Cluster-Specific Information |
49 | 69 |
|
50 |
| -By the way, we can also use the pre-built image to submit the ElasticDL job. |
| 70 | +If the on-premise cluster is a tailored version of Kubernetes which |
| 71 | +requires additional labels, or we need to add tolerations or node affinity |
| 72 | +to the job's pods, we can use an additional argument |
| 73 | +`--cluster_spec spec.py` in the command line above. We define a class instance |
| 74 | +`cluster` in `spec.py` file. There are two required class functions `with_pod` |
| 75 | +and `with_service` for adding additional specifications to pods or services. |
51 | 76 |
|
52 |
| -```bash |
53 |
| -elasticdl train \ |
54 |
| - --image_base=reg.docker.com/user/elasticdl:mnist \ |
55 |
| - --model_zoo=/model_zoo \ |
56 |
| - --model_def=mnist_functional_api.mnist_functional_api.custom_model \ |
57 |
| - --training_data=/data/mnist/train \ |
58 |
| - --validation_data=/data/mnist/test \ |
59 |
| - --num_epochs=2 \ |
60 |
| - --master_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \ |
61 |
| - --worker_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \ |
62 |
| - --minibatch_size=64 \ |
63 |
| - --num_minibatches_per_task=2 \ |
64 |
| - --num_workers=2 \ |
65 |
| - --checkpoint_steps=10 \ |
66 |
| - --grads_to_wait=2 \ |
67 |
| - --job_name=test-mnist \ |
68 |
| - --log_level=INFO \ |
69 |
| - --image_pull_policy=Always \ |
70 |
| - --namespace=kubemaker |
| 77 | +Below is an example of `spec.py`. |
| 78 | + |
| 79 | +```python |
| 80 | +from kubernetes import client |
| 81 | + |
| 82 | +class MyCluster: |
| 83 | + def __init__(self): |
| 84 | + self._pool = "elasticdl" |
| 85 | + self._app_name = "elasticdl" |
| 86 | + |
| 87 | + # Add pod specifications |
| 88 | + def with_pod(self, pod): |
| 89 | + # Add a label |
| 90 | + pod.metadata.labels["my_app"] = self._app_name |
| 91 | + |
| 92 | + # Add tolerations |
| 93 | + tolerations = [ |
| 94 | + client.V1Toleration( |
| 95 | + effect="NoSchedule", |
| 96 | + key="mycluster.com/app-pool", |
| 97 | + operator="Equal", |
| 98 | + value=self._pool , |
| 99 | + ), |
| 100 | + ] |
| 101 | + pod.spec.tolerations = tolerations |
| 102 | + return pod |
| 103 | + |
| 104 | + # Add service specifications |
| 105 | + def with_service(self, service): |
| 106 | + # Use ClusterIP |
| 107 | + service.spec.type = "ClusterIP" |
| 108 | + service.spec.cluster_ip = "None" |
| 109 | + return service |
| 110 | + |
| 111 | +cluster = MyCluster() |
71 | 112 | ```
|
0 commit comments