@@ -29,8 +29,12 @@ with Keras API, train the model distributedly with a command line.
29
29
30
30
``` bash
31
31
elasticdl train \
32
- --model_def=mnist_functional_api.custom_model \
33
- --training_data=/mnist/train --output=output
32
+ --image_name=elasticdl:mnist \
33
+ --model_zoo=model_zoo \
34
+ --model_def=mnist_functional_api.mnist_functional_api.custom_model \
35
+ --training_data=/data/mnist/train \
36
+ --job_name=test-mnist \
37
+ --volume=" host_path=/data,mount_path=/data"
34
38
```
35
39
36
40
### Integration with SQLFlow
@@ -56,9 +60,9 @@ computing job would fail; however, we can restart the job and recover its status
56
60
from the most recent checkpoint files.
57
61
58
62
ElasticDL, as an enhancement of TensorFlow's distributed training feature,
59
- supports fault-tolerance. In the case that some processes fail, the job would go
60
- on running. Therefore, ElasticDL doesn't need to checkpoint nor recover from
61
- checkpoints.
63
+ supports fault-tolerance. In the case that some processes fail, the job would
64
+ go on running. Therefore, ElasticDL doesn't need to save checkpoint nor recover
65
+ from checkpoints.
62
66
63
67
The feature of fault-tolerance makes ElasticDL works with the priority-based
64
68
preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills
0 commit comments