Skip to content

Commit b7529f3

Browse files
authored
offload++ tutorial/example (bigscience-workshop#283)
1 parent a8a684d commit b7529f3

File tree

4 files changed

+429
-0
lines changed

4 files changed

+429
-0
lines changed
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# ZeRO-Offload++ Tutorials
2+
3+
This folder contains examples that demonstrate how to use the new ZeRO-Offload++ features.
4+
5+
ZeRO-Offload++ now supports **Twin-Flow** feature.
6+
7+
## Twin-Flow
8+
9+
Instead of all-or-nothing offloading strategy, **Twin-Flow** allows a portion of data to run on CPU and the other part on GPU simultaneously. Thus, we not only mitigate the memory pressure on GPU side by offloading data to CPU, but also utilize both CPU and GPU computation resources more efficiently.
10+
11+
![Twin-Flow-img](./twin-offload.png)
12+
13+
As shown in above Figure, when ZeRO-Offload is triggered, **Twin-Flow** now allow user to set a new configuration arguement called `ratio` (default value == 1) to adjust the portion of parameter updates on CPU optimizer. For example, if this `ratio==0.4`, it means 0-40% of parameters are updated using CPUAdam on CPU side, while the rest 60% parameters are updatedusing FusedAdam on GPU side.
14+
15+
## How to use
16+
17+
Now **Twin-Flow** can be used at ZeRO stage 3 with Offload. Below we provide two tutorial examples on how to use **Twin-Flow**.
18+
19+
### DeepSpeed Toy Example
20+
21+
Here is a toy example for using **Twin-Flow** inside DeepSpeed repo.
22+
23+
Under `/tests/small_model_debugging/` folder, Run
24+
25+
```
26+
deepspeed partial_offload_test.py --zero 3
27+
```
28+
29+
### GPT Model Training in Megatron-DeepSpeed
30+
31+
To enable **Twin-Flow** here, we need to add two flags for Megatron configs as follows:
32+
33+
#### Megatron Configurations
34+
```
35+
--no-pipeline-parallel \
36+
--cpu-optimizer \
37+
```
38+
which have been added to `ds_pretrain_gpt_350M.sh`
39+
40+
#### DeepSpeed Configurations
41+
On the DeepSpeed side, we need to add follow configurations:
42+
43+
```
44+
"offload_optimizer": {
45+
"device": "cpu",
46+
"pin_memory": true,
47+
"ratio": 0.3
48+
}
49+
```
50+
51+
Basically, we need to first enable CPU Offload. Then user can adjust the portion of parameter updating on CPU by adjusting `ratio` here. Its default value is 1, which means all parameter updates happen on CPU side. The above config example with ` "ratio" : 0.3` meaning 0-30% parameters are updating on CPU side, while the other 70% parameter updates happens on GPU side.
52+
53+
#### Tuning suggestion on ratio
54+
55+
To get best performance, we recommend to set this `ratio` value as low as possible without causing GPU memory Out-Ouf-Memory issue.
56+
57+
One additional config on DeepSpeed side is
58+
59+
```
60+
"prescale_gradients": false,
61+
```
62+
mainly because right now ZeRO-3 does not support prescale gradients.
63+
64+
All above configs have been added to `ds_config_gpt_TEMPLATE.json`
65+
66+
#### End-to-end Training
67+
68+
To run a sample training of GPT-350M model using Megatron-Deepspeed, simply run as follows:
69+
70+
```
71+
bash ds_pretrain_gpt_350M.sh
72+
```
73+
74+
Now the training start running with **Twin-Flow**. Enjoy!
75+
76+
## On-going optimizations
77+
78+
We have some other features inside ZeRO-Offload++ which will come soon, stay tuned!
79+
80+
* Removing uncessary D2H memcpy in ZeRO-offload
81+
* On-the-fly fp16 to fp32 data casting inside CPUAdam
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
{
2+
"train_batch_size" : CONFIG_BATCH_SIZE,
3+
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
4+
"steps_per_print": LOG_INTERVAL,
5+
6+
"zero_optimization": {
7+
"stage": 3,
8+
"offload_optimizer": {
9+
"device": "cpu",
10+
"pin_memory": true,
11+
"ratio": 0.3
12+
}
13+
},
14+
15+
"gradient_clipping": 1.0,
16+
"prescale_gradients":false,
17+
18+
"fp16": {
19+
"enabled": CONFIG_FP16_ENABLED,
20+
"loss_scale": 0,
21+
"loss_scale_window": 500,
22+
"hysteresis": 2,
23+
"min_loss_scale": 1,
24+
"initial_scale_power": 11
25+
},
26+
27+
"bf16": {
28+
"enabled": CONFIG_BF16_ENABLED
29+
},
30+
31+
"wall_clock_breakdown" : false
32+
}

0 commit comments

Comments
 (0)