Skip to content

Commit 13a36b2

Browse files
Kubecon NA call for proposal from ElasticDL (#2135)
* Add the document placeholder for kubecon cfp. * Add some content. * Do some rephrase * Add more content for the scheduling policies. * Add more content. * Do some rephrase * Do some rephrase. * Add more content in the benefit section. * Add more content in the benefit part. * Do some rephrase * Do some rephrase. * Do some rephrase * Refactor the content according to the comments. * Do some rephrase * Fix grammar errors.
1 parent 69a618b commit 13a36b2

File tree

1 file changed

+59
-0
lines changed

1 file changed

+59
-0
lines changed

docs/blogs/kubecon_cfp.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# ElasticDL: Kubernetes-Native Distributed Deep Learning with Elastic Scheduling
2+
3+
## Description
4+
5+
In addition to online services, users have been running distributed AI jobs on
6+
Kubernetes. Related projects add Kubernetes operators to start distributed deep
7+
learning programs calling TensorFlow or PyTorch. In these solutions, Kubernetes
8+
plays the role to launch pods and restart preempted ones. However, such retrials
9+
often fail due to the same reason that caused the perception -- the lack of
10+
resources. If couldn't maintain the constant number of worker pods, jobs fail.
11+
Maintaining the constant number, it becomes gang scheduling. Either case leads
12+
to low utilization of clusters.
13+
14+
ElasticDL changes the paradox by using the statistical properties of distributed
15+
learning theory and making jobs tolerable to varying numbers of workers. Taking
16+
it as the basis, ElasticDL realizes elastic scheduling by introducing a master
17+
pod per job, to replace the Kubernetes operator per cluster. It makes full use
18+
of residual resources and improve the utilization significantly.
19+
20+
## Benefits to the Ecosystem
21+
22+
ElasticDL boosts the cluster utilization up to 90%, on on-premise clusters at Ant
23+
Financial and on Google Cloud, as it makes full use of residual resources to run
24+
deep learning jobs with elastic scheduling. Moreover, it enables the running of
25+
deep learning jobs in lower priority than online services on the same cluster.
26+
It senses and uses resource left by online services.
27+
28+
The master is important to elastic scheduling, it takes three roles.
29+
30+
1. The master dynamically partitions the training data so to decouple the number
31+
of partitions and that of the workers.
32+
2. The master also works with Kubernetes to watch the cluster utilization and a
33+
good chance to restart failed workers. Before the chance, the master always
34+
leverages living workers.
35+
3. The master starts and monitors parameter servers when training large models
36+
using the asynchronous SGD algorithm, and cooperate workers to implement a
37+
Kubernetes-native fault-tolerable AllReduce operation for the synchronous SGD
38+
counterpart.
39+
40+
Deep learning researchers like ElasticDL as it reduces the pending time of each
41+
job as it makes full use of residual resources to run as many concurrent
42+
experiments as possible. Deep learning jobs depend on many parameters,
43+
including the optimizer, activator, and cost, hyperparameters. Users are eager
44+
to see the status of the first few iterations of the learning job so to ensure
45+
the configuration is mathematically correct. ElasticDL provides a cure. Using
46+
the residual resource, ElasticDL also shortens the total time to run a batch
47+
of training jobs.
48+
49+
ElasticDL provides an easy-to-use interface. Users define models using
50+
TensorFlow 2.x API just like filling the map and reduce functions required by
51+
the MapReduce framework, without the need to consider anything about
52+
distributed programming. The interface allows users to test their models
53+
locally and run big data using ElasticDL without changing their source code.
54+
55+
## Open Source Projects
56+
57+
[ElasticDL](https://github.com/sql-machine-learning/elasticdl)
58+
59+
## Speakers

0 commit comments

Comments
 (0)