Skip to content

Commit 8868ae9

Browse files
authored
Fix markdownlint errors of design docs (#2060)
* Update * Update docs/design
1 parent 0fde2b1 commit 8868ae9

22 files changed

+1784
-845
lines changed

.markdownlint.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
---
22
default: true
33

4+
MD004:
5+
style: dash
6+
47
MD013:
58
tables: false
6-
code_blocks": false
9+
code_blocks: false
10+
11+
MD046:
12+
style: fenced

docs/designs/allreduce.md

Lines changed: 291 additions & 156 deletions
Large diffs are not rendered by default.

docs/designs/async_sgd.md

Lines changed: 125 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,90 +1,142 @@
11
# Design Doc: Asynchronous SGD
22

33
## Motivation
4-
Parameter server (PS) based distributed training uses data parallelism to speed up training.
54

6-
We have implemented synchronous SGD in ElasticDL. When PS accumulates `grads_to_wait` gradients from workers, PS averages these gradients and updates the model with the averaged gradients. PS also maintains `model_version` which equals to the number of model updates. Each worker has a local copy of the model. Before a minibatch training step starts, if there is a new `model_version` on PS, the worker will get the new model from PS to replace the local model. After computing the gradients with a minibatch, the worker reports the gradients to PS together with the local model version. When PS receives gradients from a worker, it only accepts the gradients with a model version same as the current PS `model_version`.
5+
Parameter server (PS) based distributed training uses data parallelism to speed
6+
up training.
77

8+
We have implemented synchronous SGD in ElasticDL. When PS accumulates
9+
`grads_to_wait` gradients from workers, PS averages these gradients and updates
10+
the model with the averaged gradients. PS also maintains `model_version` which
11+
equals to the number of model updates. Each worker has a local copy of the
12+
model. Before a minibatch training step starts, if there is a new
13+
`model_version` on PS, the worker will get the new model from PS to replace the
14+
local model. After computing the gradients with a minibatch, the worker reports
15+
the gradients to PS together with the local model version. When PS receives
16+
gradients from a worker, it only accepts the gradients with a model version
17+
same as the current PS `model_version`.
818

19+
This Synchronous SGD ensures model consistency in the price of wasted and
20+
blocked computation.
921

10-
This Synchronous SGD ensures model consistency in the price of wasted and blocked computation.
22+
- Wasted computation: when a worker reports gradients with an outdated model
23+
version to PS, PS will reject these gradients. The worker have to get the
24+
current model from PS, reuse the minibatch data to train the model again.
25+
- Blocked computation: PS has to use a lock for model update with gradients and
26+
model read by workers to ensure model consistency.
1127

12-
* Wasted computation: when a worker reports gradients with an outdated model version to PS, PS will reject these gradients. The worker have to get the current model from PS, reuse the minibatch data to train the model again.
13-
* Blocked computation: PS has to use a lock for model update with gradients and model read by workers to ensure model consistency.
14-
15-
Asynchronous SGD can avoid the wasted and blocked computation mentioned above with a relaxed model consistency.
16-
17-
* PS will accept all gradients from workers.
18-
* PS does not use locks and supports concurrent model reads and updates.
28+
Asynchronous SGD can avoid the wasted and blocked computation mentioned above
29+
with a relaxed model consistency.
1930

31+
- PS will accept all gradients from workers.
32+
- PS does not use locks and supports concurrent model reads and updates.
2033

2134
## Asynchronous SGD
22-
Let us recall how workers train the model in synchronous SGD. Below is the pseudocode:
2335

24-
```
36+
Let us recall how workers train the model in synchronous SGD. Below is the
37+
pseudocode:
38+
39+
```python
2540
for minibatch in training_data:
2641
accepted = False
2742
while not accepted:
2843
local_model,model_version = get_model_from_ps()
2944
gradients = compute_gradient(local_model, minibatch)
30-
accepted = report_gradient_to_ps(gradients, model_version)
45+
accepted = report_gradient_to_ps(gradients, model_version)
3146
```
3247

33-
In asynchronous SGD, each worker is training the model in nearly the same way as synchronous SGD. The only difference is that the worker does not need to retrain any minibatch data as PS accepts all gradients.
48+
In asynchronous SGD, each worker is training the model in nearly the same way
49+
as synchronous SGD. The only difference is that the worker does not need to
50+
retrain any minibatch data as PS accepts all gradients.
3451

35-
```
52+
```python
3653
for minibatch in training_data:
3754
local_model, model_version = get_model_from_ps()
3855
gradients = compute_gradient(local_model, minibatch)
39-
report_gradient_to_ps(gradients, model_version)
56+
report_gradient_to_ps(gradients, model_version)
4057
```
4158

42-
PS does not need locks in `GetModel` and `ReportGradient` GRPC services for asynchronous SGD.
59+
PS does not need locks in `GetModel` and `ReportGradient` GRPC services for
60+
asynchronous SGD.
4361

44-
```
62+
```python
4563
def GetModel():
4664
pb_model = Model()
4765
for variable in pb_model:
4866
assign_value(variable, current_variable_value_in_PS)
4967
return pb_model, PS_model_version
50-
68+
5169
def ReportGradient(gradients, version):
5270
grad_var = zip(gradients, model_variables)
5371
optimizer.apply_gradient(grad_var)
5472
PS_model_version.atomic_add(1)
5573
```
5674

5775
### Relaxed Model Consistency
58-
PS can processes multiple GRPC calls `GetModel` and `ReportGradients` concurrently. Thus, there are two kinds of relaxed model consistency.
5976

60-
1. In `GetModel`, during the variable assign loop, there may be `ReportGradient` GRPC service running and updating the variables. Thus, variables in `local_model` in workers may contain values from different model versions. `model_version` from `get_model_from_ps` is just a proximate model version.
61-
2. There may be multiple `ReportGradient` running concurrently. Different model variables may apply these gradients in different orders.
77+
PS can processes multiple GRPC calls `GetModel` and `ReportGradients`
78+
concurrently. Thus, there are two kinds of relaxed model consistency.
79+
80+
1. In `GetModel`, during the variable assign loop, there may be
81+
`ReportGradient` GRPC service running and updating the variables. Thus,
82+
variables in `local_model` in workers may contain values from different model
83+
versions. `model_version` from `get_model_from_ps` is just a proximate model
84+
version.
85+
2. There may be multiple `ReportGradient` running concurrently. Different model
86+
variables may apply these gradients in different orders.
6287

63-
Also, the concurrent updates to variables in `ReportGradient` may cause some gradients are not applied, as the updates can be overwritten by other concurrent running updates. TensorFlow optimizers have an argument [`use_locking`](https://github.com/tensorflow/tensorflow/blob/ff441191277b7e758deb48e45249fee9e880f2c8/tensorflow/python/training/optimizer.py#L319). If [`use_locking`](https://github.com/tensorflow/tensorflow/blob/ff441191277b7e758deb48e45249fee9e880f2c8/tensorflow/python/training/optimizer.py#L319) is `True`, TensorFlow will use a [lock](https://github.com/tensorflow/tensorflow/blob/11e22c01eb801ff24200afcdce8a03a7cdd2ed3f/tensorflow/core/kernels/training_ops.cc#L528) to prevent concurrent updates to variables.
88+
Also, the concurrent updates to variables in `ReportGradient` may cause some
89+
gradients are not applied, as the updates can be overwritten by other concurrent
90+
running updates. TensorFlow optimizers have an argument
91+
[`use_locking`](https://github.com/tensorflow/tensorflow/blob/ff441191277b7e758deb48e45249fee9e880f2c8/tensorflow/python/training/optimizer.py#L319).
92+
93+
If
94+
[`use_locking`](https://github.com/tensorflow/tensorflow/blob/ff441191277b7e758deb48e45249fee9e880f2c8/tensorflow/python/training/optimizer.py#L319)
95+
is `True`, TensorFlow will use a
96+
[lock](https://github.com/tensorflow/tensorflow/blob/11e22c01eb801ff24200afcdce8a03a7cdd2ed3f/tensorflow/core/kernels/training_ops.cc#L528)
97+
to prevent concurrent updates to variables.
6498

6599
### Staleness in Asynchronous SGD
66-
In `ReportGradient`, the argument `version` may be smaller than `PS_model_version`.
100+
101+
In `ReportGradient`, the argument `version` may be smaller than
102+
`PS_model_version`.
67103
Staleness value is the difference between `PS_model_version` and `version`:
68104

69-
```
105+
```python
70106
staleness = PS_model_version - version
71107
```
72108

73-
According to some [researches](https://arxiv.org/abs/1810.03264), this staleness affects the training convergence, and large staleness may result in poor training accuracy. The deeper the model, the more impact of the staleness. Some optimizers such as [SGD](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers/SGD) and [Adagrad](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers/Adagrad) are more robust to staleness, some optimizers such as other with momentum are very bad with staleness.
74-
75-
[Staleness-aware asychronous SGD](https://arxiv.org/abs/1511.05950) proposes a method to modulate learning rate by the staleness. If the staleness is not 0, this method modulates the learning rate used in the optimizer as:
76-
77-
```
109+
According to some [researches](https://arxiv.org/abs/1810.03264), this staleness
110+
affects the training convergence, and large staleness may result in poor
111+
training accuracy. The deeper the model, the more impact of the staleness. Some
112+
optimizers such as
113+
[SGD](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers/SGD)
114+
and
115+
[Adagrad](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers/Adagrad)
116+
are more robust to staleness, some optimizers such as other with momentum are
117+
very bad with staleness.
118+
119+
[Staleness-aware asychronous SGD](https://arxiv.org/abs/1511.05950) proposes a
120+
method to modulate learning rate by the staleness. If the staleness is not 0,
121+
this method modulates the learning rate used in the optimizer as:
122+
123+
```python
78124
if staleness > 0:
79125
learning_rate_used = learning_rate / staleness
80126
else:
81127
learning_rate_used = learning_rate
82128
```
83129

84130
### Stale Synchronous Parallel (SSP)
85-
In the pseudocode for the asynchronous SGD worker, the worker pulls model from PS in every minibatch step. [Stale synchronous parallel (SSP) method](https://dl.acm.org/citation.cfm?id=2999748) uses the strategy that the fastest worker can exceed the slowest one within a predefined staleness threshold. SSP can reduce the number of `get_model_from_ps` calls. The worker training process is:
86131

87-
```
132+
In the pseudocode for the asynchronous SGD worker, the worker pulls model from
133+
PS in every minibatch step. [Stale synchronous parallel (SSP)
134+
method](https://dl.acm.org/citation.cfm?id=2999748) uses the strategy that the
135+
fastest worker can exceed the slowest one within a predefined staleness
136+
threshold. SSP can reduce the number of `get_model_from_ps` calls. The worker
137+
training process is:
138+
139+
```python
88140
get_model_frequency = predefined_staleness_threshold
89141
local_model, model_version = get_model_from_ps()
90142
local_update_count = 0
@@ -98,23 +150,53 @@ for minibatch in training_data:
98150
else:
99151
apply_gradient(local_model, gradients)
100152
```
101-
Although the original SSP method uses this strategy in synchronized SGD, we can also adopt SSP strategy in asynchronized SGD to reduce `get_model_from_ps` calls.
102-
Note that in ElasticDL, local models only have non-embedding variables. So in `apply_gradient(local_model, gradients)`, ElasticDL workers only update non-embedding variables.
103-
Also, worker can run `report_gradient_to_ps` concurrently with `apply_gradient(local_model, gradients)` when it does not need to `get_model_from_ps`.
153+
154+
Although the original SSP method uses this strategy in synchronized SGD, we can
155+
also adopt SSP strategy in asynchronized SGD to reduce `get_model_from_ps`
156+
calls.
157+
Note that in ElasticDL, local models only have non-embedding variables. So in
158+
`apply_gradient(local_model, gradients)`, ElasticDL workers only update
159+
non-embedding variables.
160+
Also, worker can run `report_gradient_to_ps` concurrently with
161+
`apply_gradient(local_model, gradients)` when it does not need to
162+
`get_model_from_ps`.
104163

105164
## Support Asynchronous SGD in ElasticDL
106165

107166
### Change in PS
108-
1. No need to use locks in `GetModel` and `_update_model` in [server.py](../../elasticdl/python/master/servicer.py).
109-
2. No need to accumulate gradients in `ReportGradient` in [server.py](../../elasticdl/python/master/servicer.py). `ReportGradient` calls `_update_model` directly.
110-
3. Users decide if disabling concurrent variable update by set `use_locking` argument in the optimizer.
111-
4. To support [Staleness-aware asychronous SGD](https://arxiv.org/abs/1511.05950), PS need to modulate the learning rate in the optimizer with the staleness value. PS may have multiple threads running concurrently for model updates with a same optimizer instance. Thus, we cannot modify the learning rate in the optimizer instance. We may modify the learning rate as a callable method, and use a thread local storage `threading.local()` to store the staleness. The callable method uses the staleness value to modulate the learning rate. The optimizer will call this callable method [when it reads the learning rate hyperparameter](https://github.com/tensorflow/tensorflow/blob/e4262fb2fbf1cb33aaea79ff81754d1e92e99af1/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L530).
167+
168+
1. No need to use locks in `GetModel` and `_update_model` in
169+
[server.py](../../elasticdl/python/master/servicer.py).
170+
2. No need to accumulate gradients in `ReportGradient` in
171+
[server.py](../../elasticdl/python/master/servicer.py). `ReportGradient`
172+
calls `_update_model` directly.
173+
3. Users decide if disabling concurrent variable update by set `use_locking`
174+
argument in the optimizer.
175+
4. To support Staleness-aware asychronous
176+
[SGD](https://arxiv.org/abs/1511.05950), PS need to modulate the learning
177+
rate in the optimizer with the staleness value. PS may have multiple threads
178+
running concurrently for model updates with a same optimizer instance. Thus,
179+
we cannot modify the learning rate in the optimizer instance. We may modify
180+
the learning rate as a callable method, and use a thread local storage
181+
`threading.local()` to store the staleness. The callable method uses the
182+
staleness value to modulate the learning rate. The optimizer will call this
183+
callable method when it reads the learning rate
184+
[hyperparameter](https://github.com/tensorflow/tensorflow/blob/e4262fb2fbf1cb33aaea79ff81754d1e92e99af1/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L530).
112185

113186
### Change in Worker
187+
114188
1. No need to retrain with the minibatch data.
115-
2. To support SSP strategy, the worker pulls the model from PS in every `get_model_frequency` minibatch step. Also, the worker needs to update the local model with the computed gradients. model pull/updates do not include embedding variables, as we directly access the embedding vectors in the embedding service.
189+
2. To support SSP strategy, the worker pulls the model from PS in every
190+
`get_model_frequency` minibatch step. Also, the worker needs to update the
191+
local model with the computed gradients. model pull/updates do not include
192+
embedding variables, as we directly access the embedding vectors in the
193+
embedding service.
116194

117195
### Add Arguments for `elasticdl.train`
118-
1. `--use_async, default=False, help="True for asynchronous SGD, False for synchronous SGD"`
119-
2. `--lr_staleness_modulation, default=False, help="If True, master will modulate learning rate with staleness in asynchronous SGD"`
120-
3. `--get_model_frequency, default=1, help="worker will get_model from PS every these steps."`
196+
197+
1. `--use_async, default=False, help="True for asynchronous SGD, False for
198+
synchronous SGD"`
199+
2. `--lr_staleness_modulation, default=False, help="If True, master will
200+
modulate learning rate with staleness in asynchronous SGD"`
201+
3. `--get_model_frequency, default=1, help="worker will get_model from PS every
202+
these steps."`

0 commit comments

Comments
 (0)