5
5
[ ![ License: MIT] ( https://img.shields.io/badge/License-MIT-blue.svg )] ( https://opensource.org/licenses/MIT )
6
6
[ ![ PyPI Status Badge] ( https://badge.fury.io/py/elasticdl-client.svg )] ( https://pypi.org/project/elasticdl-client/ )
7
7
8
- ElasticDL is a Kubernetes-native deep learning framework built on top of
9
- TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.
8
+ ElasticDL is a Kubernetes-native deep learning framework
9
+ that supports fault-tolerance and elastic scheduling.
10
10
11
11
## Main Features
12
12
@@ -16,11 +16,11 @@ Through Kubernetes-native design, ElasticDL enables fault-tolerance and works
16
16
with the priority-based preemption of Kubernetes to achieve elastic scheduling
17
17
for deep learning tasks.
18
18
19
- ### TensorFlow 2.0 Eager Execution
19
+ ### Support TensorFlow and PyTorch
20
20
21
- A distributed deep learning framework needs to know local gradients before the
22
- model update. Eager Execution allows ElasticDL to do it without hacking into the
23
- graph execution process.
21
+ - TensorFlow Estimator.
22
+ - TensorFlow Keras.
23
+ - PyTorch
24
24
25
25
### Minimalism Interface
26
26
@@ -37,30 +37,27 @@ elasticdl train \
37
37
--volume=" host_path=/data,mount_path=/data"
38
38
```
39
39
40
- ### Integration with SQLFlow
41
-
42
- ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to
43
- distributed deep learning tasks with ElasticDL.
44
-
45
- ``` sql
46
- SELECT * FROM employee LABEL income INTO my_elasticdl_model
47
- ```
48
-
49
40
## Quick Start
50
41
51
42
Please check out our [ step-by-step tutorial] ( docs/tutorials/get_started.md ) for
52
43
running ElasticDL on local laptop, on-prem cluster, or on public cloud such as
53
44
Google Kubernetes Engine.
54
45
46
+ [ TensorFlow Estimator on MiniKube] ( docs/tutorials/elasticdl_estimator.md )
47
+
48
+ [ TensorFlow Keras on MiniKube] ( docs/tutorials/elasticdl_local.md )
49
+
50
+ [ PyTorch on MiniKube] ( docs/tutorials/elasticdl_torch.md )
51
+
55
52
## Background
56
53
57
- TensorFlow has its native distributed computing feature that is
54
+ TensorFlow/PyTorch has its native distributed computing feature that is
58
55
fault-recoverable. In the case that some processes fail, the distributed
59
56
computing job would fail; however, we can restart the job and recover its status
60
57
from the most recent checkpoint files.
61
58
62
- ElasticDL, as an enhancement of TensorFlow's distributed training feature,
63
- supports fault-tolerance. In the case that some processes fail, the job would
59
+ ElasticDL supports fault-tolerance during distributed training.
60
+ In the case that some processes fail, the job would
64
61
go on running. Therefore, ElasticDL doesn't need to save checkpoint nor recover
65
62
from checkpoints.
66
63
@@ -80,11 +77,11 @@ first job completes. In this case, the overall utilization is 100%.
80
77
81
78
The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native
82
79
design -- it doesn't rely on Kubernetes extensions like Kubeflow to run
83
- TensorFlow programs; instead, the master process of an ElasticDL job calls
80
+ TensorFlow/PyTorch programs; instead, the master process of an ElasticDL job calls
84
81
Kubernetes API to start workers and parameter servers; it also watches events
85
82
like process/pod killing and reacts to such events to realize fault-tolerance.
86
83
87
- In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic
84
+ In short, ElasticDL enhances TensorFlow/PyTorch with fault-tolerance and elastic
88
85
scheduling in the case that you have a Kubernetes cluster. We provide a tutorial
89
86
showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL
90
87
jobs there. We respect TensorFlow's native distributed computing feature, which
0 commit comments