Skip to content

Commit 52f8528

Browse files
authored
Tutorial update: how to add cluster-specific information (#2133)
* Add how to add cluster-specifition information doc * fix format * fix format * revise * revise
1 parent ef3bc0d commit 52f8528

File tree

1 file changed

+91
-50
lines changed

1 file changed

+91
-50
lines changed
Lines changed: 91 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,112 @@
11
# ElasticDL on On-prem Cluster
22

3-
## Environment preparation
3+
## Environment Preparation
44

5-
You should install ElasticDL first. Please refer to the installation part in
6-
[elastic_local](elasticdl_local.md) doc.
5+
In order to find and access the on-premise cluster, ElasticDL needs a
6+
[kubeconfig file](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters)
7+
, which is located at `~/.kube/config` by default.
78

8-
Then, build needed images.
9+
We also need to install ElasticDL client.
910

1011
```bash
11-
export TRAVIS_BUILD_DIR=$PWD
12-
bash scripts/travis/build_images.sh
12+
pip install elasticdl-client
1313
```
1414

15-
## Submit job to cluster
15+
## Submit Job to Cluster
1616

17-
The submit command is similar to local mode. The local scripts will be built
18-
into a docker image, and pushed to `$DOCKER_HUB_REPO` remote docker hub.
19-
20-
Following is an exmaple:
17+
The job submission steps are similar to public cloud mode. Please
18+
refer to [Submit Job](elasticdl_cloud.md#submit-job-to-the-kubernetes-cluster)
19+
section in [ElasticDL on Public Cloud tutorial](elasticdl_cloud.md)
20+
for detail. The difference is that we are not restricted to google cloud
21+
repo. So we can push the image to any remote docker hub that the on-premise
22+
cluster can access.
2123

2224
```bash
2325
export DOCKER_HUB_REPO=reg.docker.com/user/
26+
27+
cd ${CODE_PATH}/elasticdl/model_zoo
28+
29+
elasticdl zoo init
30+
31+
elasticdl zoo build --image=${DOCKER_HUB_REPO}/elasticdl:mnist .
32+
33+
elasticdl zoo push ${DOCKER_HUB_REPO}/elasticdl:mnist
2434
```
2535

36+
We launch a training job with 2 PS pods and 4 worker pods.
37+
2638
```bash
2739
elasticdl train \
28-
--image_base=elasticdl:ci \
29-
--docker_image_prefix=$DOCKER_HUB_REPO \
30-
--model_zoo=./model_zoo \
31-
--model_def=mnist_functional_api.mnist_functional_api.custom_model \
32-
--training_data=/data/mnist/train \
33-
--validation_data=/data/mnist/test \
34-
--num_epochs=2 \
35-
--master_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \
36-
--worker_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \
37-
--minibatch_size=64 \
38-
--num_minibatches_per_task=2 \
39-
--num_workers=2 \
40-
--checkpoint_steps=10 \
41-
--grads_to_wait=2 \
42-
--job_name=test-mnist \
43-
--log_level=INFO \
44-
--image_pull_policy=Always \
45-
--namespace=kubemaker
40+
--image_name=${DOCKER_HUB_REPO}/elasticdl:mnist \
41+
--model_zoo=model_zoo \
42+
--model_def=mnist_functional_api.mnist_functional_api.custom_model \
43+
--training_data=/data/mnist/train \
44+
--validation_data=/data/mnist/test \
45+
--num_epochs=5 \
46+
--master_resource_request="cpu=2,memory=2048Mi" \
47+
--master_resource_limit="cpu=2,memory=2048Mi" \
48+
--master_pod_priority=high \
49+
--worker_resource_request="cpu=2,memory=2048Mi" \
50+
--worker_resource_limit="cpu=2,memory=2048Mi" \
51+
--worker_pod_priority=low \
52+
--ps_resource_request="cpu=2,memory=2048Mi" \
53+
--ps_resource_limit="cpu=2,memory=2048Mi" \
54+
--ps_pod_priority=high \
55+
--minibatch_size=64 \
56+
--num_minibatches_per_task=64 \
57+
--num_ps_pods=2 \
58+
--num_workers=4 \
59+
--evaluation_steps=200 \
60+
--grads_to_wait=1 \
61+
--job_name=test-mnist \
62+
--log_level=INFO \
63+
--image_pull_policy=Always \
64+
--volume="mount_path=/data,claim_name=fileserver-claim" \
65+
--distribution_strategy=ParameterServerStrategy
4666
```
4767

48-
Then the job will be launched on the cluster.
68+
## Add Cluster-Specific Information
4969

50-
By the way, we can also use the pre-built image to submit the ElasticDL job.
70+
If the on-premise cluster is a tailored version of Kubernetes which
71+
requires additional labels, or we need to add tolerations or node affinity
72+
to the job's pods, we can use an additional argument
73+
`--cluster_spec spec.py` in the command line above. We define a class instance
74+
`cluster` in `spec.py` file. There are two required class functions `with_pod`
75+
and `with_service` for adding additional specifications to pods or services.
5176

52-
```bash
53-
elasticdl train \
54-
--image_base=reg.docker.com/user/elasticdl:mnist \
55-
--model_zoo=/model_zoo \
56-
--model_def=mnist_functional_api.mnist_functional_api.custom_model \
57-
--training_data=/data/mnist/train \
58-
--validation_data=/data/mnist/test \
59-
--num_epochs=2 \
60-
--master_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \
61-
--worker_resource_request="cpu=1,memory=2048Mi,ephemeral-storage=5000Mi" \
62-
--minibatch_size=64 \
63-
--num_minibatches_per_task=2 \
64-
--num_workers=2 \
65-
--checkpoint_steps=10 \
66-
--grads_to_wait=2 \
67-
--job_name=test-mnist \
68-
--log_level=INFO \
69-
--image_pull_policy=Always \
70-
--namespace=kubemaker
77+
Below is an example of `spec.py`.
78+
79+
```python
80+
from kubernetes import client
81+
82+
class MyCluster:
83+
def __init__(self):
84+
self._pool = "elasticdl"
85+
self._app_name = "elasticdl"
86+
87+
# Add pod specifications
88+
def with_pod(self, pod):
89+
# Add a label
90+
pod.metadata.labels["my_app"] = self._app_name
91+
92+
# Add tolerations
93+
tolerations = [
94+
client.V1Toleration(
95+
effect="NoSchedule",
96+
key="mycluster.com/app-pool",
97+
operator="Equal",
98+
value=self._pool ,
99+
),
100+
]
101+
pod.spec.tolerations = tolerations
102+
return pod
103+
104+
# Add service specifications
105+
def with_service(self, service):
106+
# Use ClusterIP
107+
service.spec.type = "ClusterIP"
108+
service.spec.cluster_ip = "None"
109+
return service
110+
111+
cluster = MyCluster()
71112
```

0 commit comments

Comments
 (0)