Skip to content

Commit 1305c68

Browse files
author
Helin Wang
committed
fix according to comments
1 parent 0084edd commit 1305c68

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

doc/design/dist/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@
44

55
We want Paddle to support training on the general-purpose cluster. The cluster runs Paddle, the Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. As illustrated in the following graph:
66

7-
![general purpose cluster](src/arch.png)
7+
<img src="src/arch.png"/>
88

99
This poses new challenges for Paddle,
1010

1111
- Paddle need to be fault tolerant.
12-
- Input training data can be online data from real time logs or batched data from distributed file system.
12+
- Input training data can be online data from real time logs or batch data from distributed file system.
1313
- User needs a simple way to train model on Paddle cloud. Complexities such as job scheduling should be hidden from user.
1414

1515
## Training Job
@@ -22,7 +22,7 @@ A training job will be created once user asks Paddle cloud to train a model. The
2222

2323
One training job will only have one master process, typically multiple trainer processes and parameter server processes. Their relation is illustrated in the following graph:
2424

25-
![process collabration](src/paddle-on-kubernetes-invited-blog-model-sharding.png)
25+
<img src="src/paddle-on-kubernetes-invited-blog-model-sharding.png"/>
2626

2727
### Master Process
2828

@@ -38,15 +38,15 @@ Master process will:
3838

3939
Master process has three task queues to track training progress as shown in the graph below:
4040

41-
![task queues](src/paddle-task-queues.png)
41+
<img src="src/paddle-task-queues.png"/>
4242

43-
- The todo queue holds tasks to be dispatched.
43+
- The todo queue holds tasks to be dispatched. When a job starts, the master process fills in the todo queue with all tasks.
4444
- The pending queue holds tasks that are currently training by trainers, and a mapping from trainers to their training tasks.
4545
- the done queue holds tasks that are already trained.
4646

4747
A dataset will be sharded into tasks and dispatched by the master process. The life cycle of a single task is illustrated below:
4848

49-
![task states](src/paddle-task-states.png)
49+
<img src="src/paddle-task-states.png"/>
5050

5151
1. When a new pass of training starts, all tasks will be placed in the todo queue.
5252
1. The master process will dispatch few tasks to each trainer at a time, puts them in the pending queue and waits for completion.

0 commit comments

Comments
 (0)