Skip to content

Commit d41b402

Browse files
author
Helin Wang
committed
fix according to comments
1 parent dd27552 commit d41b402

File tree

3 files changed

+9
-10
lines changed

3 files changed

+9
-10
lines changed

doc/design/dist/README.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,24 @@
22

33
## Objective
44

5-
We want Paddle to support training on the general-purpose cluster. The cluster runs Paddle, the web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. As illustrated in the following graph:
5+
In [this slides](https://www.slideshare.net/cxwangyi/paddlepaddle-a-complete-solution-for-businesses), we explained that we'd like PaddlePaddle running on general-purpose clusters like those managed by Kubernetes, so to address demands for AI from both Internet and non-Internet industries.
66

7-
<img src="src/arch.png"/>
7+
This poses technical challenges to PaddlePaddle:
88

9-
This poses new challenges for Paddle,
9+
1. Support fault-recovery.
10+
1. Support both offline and online training.
11+
1. [Serverless computing](https://en.wikipedia.org/wiki/Serverless_computing) of distributed training.
1012

11-
- Paddle need to be fault tolerant.
12-
- Input training data can be online data from real time logs or batch data from distributed file system.
13-
- User needs a simple way to train model on Paddle cloud. Complexities such as job scheduling should be hidden from user.
1413

1514
## Training Job
1615

1716
A training job will be created once user asks Paddle cloud to train a model. The training job is made up of different processes that collaboratively consume data and produce a trained model. There are three kinds of processes:
1817

19-
- Master process
20-
- Trainer process
21-
- Parameter server process
18+
1. the *master process*, which dispatches tasks to
19+
1. one or more *trainer processes*, which run distributed training and synchronize gradients/models via
20+
1. one or more *parameter server processes*, where each holds a shard of the global model.
2221

23-
One training job will only have one master process, typically multiple trainer processes and parameter server processes. Their relation is illustrated in the following graph:
22+
Their relation is illustrated in the following graph:
2423

2524
<img src="src/paddle-model-sharding.png"/>
2625

doc/design/dist/src/arch.graffle

-3.74 KB
Binary file not shown.

doc/design/dist/src/arch.png

-50.9 KB
Binary file not shown.

0 commit comments

Comments
 (0)