Skip to content

Commit 0084edd

Browse files
author
Helin Wang
committed
fix grammar
1 parent c5b01d0 commit 0084edd

File tree

1 file changed

+23
-23
lines changed

1 file changed

+23
-23
lines changed

doc/design/dist/README.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,72 +2,72 @@
22

33
## Objective
44

5-
We want Paddle to support training on a general-purpose cluster. The cluster runs Paddle, the Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. As illustrated in the following graph:
5+
We want Paddle to support training on the general-purpose cluster. The cluster runs Paddle, the Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. As illustrated in the following graph:
66

77
![general purpose cluster](src/arch.png)
88

99
This poses new challenges for Paddle,
1010

11-
- Paddle need to be tault tolerant.
12-
- Input training data can be online data from realtime logs, or batched data from distributed file system.
13-
- User needs a simple way to train model on cloud. Complexities such as job scheduling should be hidden from user.
11+
- Paddle need to be fault tolerant.
12+
- Input training data can be online data from real time logs or batched data from distributed file system.
13+
- User needs a simple way to train model on Paddle cloud. Complexities such as job scheduling should be hidden from user.
1414

1515
## Training Job
1616

17-
A training job will be created once user asks Paddle cloud to train a model. The training job is made up of different processes that collabratively consume data input and produce a trained model. There are three kind of processes:
17+
A training job will be created once user asks Paddle cloud to train a model. The training job is made up of different processes that collaboratively consume data and produce a trained model. There are three kinds of processes:
1818

1919
- Master process
2020
- Trainer process
2121
- Parameter server process
2222

23-
One training job will only have one master process, typicall multiple trainer and parameter server processes. Their relation is illustrated in the following graph:
23+
One training job will only have one master process, typically multiple trainer processes and parameter server processes. Their relation is illustrated in the following graph:
2424

2525
![process collabration](src/paddle-on-kubernetes-invited-blog-model-sharding.png)
2626

2727
### Master Process
2828

2929
Master process will:
3030

31-
- keep a list of alive trainers and a list of alive parameter servers and do *health check*,
32-
- if trainer is dead it will update task queue accordingly as mentioned in [task queue](#task-queue).
33-
- if a parameter server is dead or a new parameter server joins, it will broacast this information to all trainers.
34-
- dispatches tasks to trainers. A *task* is a unit of data that a trainer needs to train on, and
35-
- keep track of training progress on the dataset with *task queue*. Typically training will iterate on the dataset for a full pass until it goes into next pass.
31+
- Keep a list of alive trainers and a list of alive parameter servers and do health check.
32+
- If a trainer is dead it will update the task queue accordingly as mentioned in [task queue](#task-queue).
33+
- If a parameter server is dead or a new parameter server joins, it will broadcast this information to all trainers.
34+
- Dispatches tasks to trainers. A *task* is a unit of data that a trainer needs to train on.
35+
- Keep track of training progress on the dataset with *task queue*. A training job will iterate on the dataset for a full pass until it goes into next pass.
3636

3737
#### Task Queue
3838

39-
Master process have three task queues to track training progress as shown in the graph below:
39+
Master process has three task queues to track training progress as shown in the graph below:
4040

4141
![task queues](src/paddle-task-queues.png)
4242

43-
- Todo queue holds tasks to be dispatched.
44-
- Pending queue holds tasks that are currently training by trainers, and a mapping from trainers to their training tasks.
45-
- Done queue holds tasks that are already trained.
43+
- The todo queue holds tasks to be dispatched.
44+
- The pending queue holds tasks that are currently training by trainers, and a mapping from trainers to their training tasks.
45+
- the done queue holds tasks that are already trained.
4646

4747
A dataset will be sharded into tasks and dispatched by the master process. The life cycle of a single task is illustrated below:
4848

4949
![task states](src/paddle-task-states.png)
5050

5151
1. When a new pass of training starts, all tasks will be placed in the todo queue.
52-
1. The master process will dispatch few tasks to each trainer at a time, puts them in pending queue and waits for completion.
53-
1. The trainer will work on it's tasks and tell master once a task is completed. The master process will dispatch a new task to that trainer.
52+
1. The master process will dispatch few tasks to each trainer at a time, puts them in the pending queue and waits for completion.
53+
1. The trainer will work on it's tasks and tell the master process once a task is completed. The master process will dispatch a new task to that trainer.
5454
1. If a trainer is dead. the master process will move it's tasks back to the todo queue.
55-
1. The master will move completed task to the done queue. When todo queue is empty, master will start a new pass by moving all tasks in done queue to todo queue.
55+
1. The master process will move completed task to the done queue. When the todo queue is empty, the master process will start a new pass by moving all tasks in the done queue to todo queue.
5656

5757
### Trainer Process
5858

59-
Trainer process will train it's current tasks, tell parameter servers it's accumulated gradient, and download latest model from parameter servers.
59+
The trainer process will train its current tasks, tell parameter servers its accumulated gradient, and download the latest model from parameter servers.
6060

61-
Trainer holds entire network model while each parameter server hold a shard of model. So trainer needs to communicate will all parameter servers.
61+
The trainer holds entire network model while each parameter server holds a shard of the model. So trainer needs to communicate will all parameter servers.
6262

6363
Communication involves two parts:
6464

65-
- upload accumulated gradient. Upload can be configured to happen every **n** mini-batches.
66-
- download new model. Download can be configured to happend every **m** mini-batches. **n** and **m** does not need to be equal.
65+
- Upload accumulated gradient. Upload can be configured to happen every **n** mini-batches.
66+
- Download new model. Download can be configured to happen every **m** mini-batches. **n** and **m** do not have to be equal.
6767

6868
### Parameter Server Process
6969

70-
Parameter server processes hold model together. Since model parameters are sharded and saved on different parameter servers. All parameter servers collabratively form the global view of trained model.
70+
Parameter server processes hold model together. Since model parameters are sharded and saved on different parameter servers. All parameter servers collabratively form the global view of a trained model.
7171

7272
## Fault Tolerant
7373

0 commit comments

Comments
 (0)