Skip to content

Commit c2c0ae5

Browse files
polish markdown file format (#2045)
* Rename cli.md to client.md * Polish the format of overall.md
1 parent 658a79d commit c2c0ae5

File tree

2 files changed

+24
-9
lines changed

2 files changed

+24
-9
lines changed
File renamed without changes.

docs/designs/overall.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,28 +4,43 @@
44

55
![architecture](../images/architecture.png)
66

7-
ElasticDL uses the master-worker architecture. The master node plays the master role in two aspects.
8-
9-
1. It's the master of the cluster. It manages the lifecycle of all the worker pods, starts the worker pod, listens to the pod event and relaunches the terminated worker pod if necessary.
10-
2. It's the master of the model training/evaluation/prediction process. It partitions data into shards, generates and dispatches tasks to workers, coordinates all the nodes to complete the training/evaluation/prediction job. (see more details in *distributed training* section)
11-
12-
ElasticDL client is simple, just like a CLI command. User inputs ElasticDL command in the terminal to start the training/evaluation/prediction job. The client parses the parameters, builds the docker image which packages the ElasticDL framework and the model code, pushes the image into the hub, and then sends request to the kubernetes ApiServer to create the master pod. After the master pod is created and started, it will then create other components and drive the process of the entire job.
7+
ElasticDL uses the master-worker architecture. The master node plays the master
8+
role in two aspects.
9+
10+
1. It's the master of the cluster. It manages the lifecycle of all the worker
11+
pods, starts the worker pod, listens to the pod event and relaunches the
12+
terminated worker pod if necessary.
13+
1. It's the master of the model training/evaluation/prediction process. It
14+
partitions data into shards, generates and dispatches tasks to workers,
15+
coordinates all the nodes to complete the training/evaluation/prediction job.
16+
(see more details in [distributed training](#Distributed-Training) section)
17+
18+
ElasticDL client is simple, just like a CLI command. User inputs ElasticDL
19+
command in the terminal to start the training/evaluation/prediction job. The
20+
client parses the parameters, builds the docker image which packages the
21+
ElasticDL framework and the model code, pushes the image into the hub, and then
22+
sends request to the kubernetes ApiServer to create the master pod. After the
23+
master pod is created and started, it will then create other components and
24+
drive the process of the entire job.
1325

1426
## Distributed Training
1527

1628
![distributed_training_sequence](../images/distributed_training_sequence.jpg)
1729

1830
Master
1931

20-
* Partition the training/evaluation data into mutiple shards. (see [dynamic_data_sharding_design](dynamic_data_sharding.md))
32+
* Partition the training/evaluation data into mutiple shards. (see
33+
[dynamic_data_sharding_design](dynamic_data_sharding.md))
2134
* Generate the training/evaluation tasks from the data shards.
2235
* Dispatch these tasks to different workers.
2336
* Aggregate the gradients reported from the workers.
2437
* Update the model variables and save the checkpoint if necessary.
2538

2639
Worker
2740

28-
* Pull the task from the master. The task contains the index of this data shard.
29-
* Read the data according to the data index message. (see [data_io_pipeline_design](data_io_pipeline.md))
41+
* Pull the task from the master. The task contains the index of this data
42+
shard.
43+
* Read the data according to the data index message. (see
44+
[data_io_pipeline_design](data_io_pipeline.md))
3045
* Run the training process using this data shard.
3146
* Report the calculated gradients and task result to the master.

0 commit comments

Comments
 (0)