You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following instructions refer to Dolly v1 and still need to be updated for v2 training.
80
-
81
79
* Add the `dolly` repo to Databricks (under Repos click Add Repo, enter `https://github.com/databrickslabs/dolly.git`, then click Create Repo).
82
80
* Start a `12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)` single-node cluster with node type having 8 A100 GPUs (e.g. `Standard_ND96asr_v4` or `p4d.24xlarge`). Note that these instance types may not be available in all regions, or may be difficult to provision. In Databricks, note that you must select the GPU runtime first, and unselect "Use Photon", for these instance types to appear (where supported).
83
81
* Open the `train_dolly` notebook in the Repo (which is the `train_dolly.py` file in the Github `dolly` repo), attach to your GPU cluster, and run all cells. When training finishes, the notebook will save the model under `/dbfs/dolly_training`.
84
82
85
83
## Training on Other Instances
86
84
87
-
A100 instance types are not available in all cloud regions, or can be hard to provision. Training is possible on other GPU instance types, with small modifications to reduce memory usage.
88
-
Training will take longer on these instances. These modifications are not necessarily optimal, but are simple to make.
85
+
A100 instance types are not available in all cloud regions, or can be hard to provision. Training is possible on other GPU instance types,
86
+
for smaller Dolly model sizes, and with small modifications to reduce memory usage.
87
+
These modifications are not optimal, but are simple to make.
89
88
90
89
### A10 GPUs
91
90
92
-
To run on A10 instances (ex: `g5.24xlarge`, 4 x A10 24GB; `Standard_NV72ads_A10_v5`, 2 x A10), make the following changes:
91
+
Training the 12B param model is not recommended on A10s.
92
+
93
+
To train the 6.9B param model on A10 instances (ex: `g5.24xlarge`, 4 x A10 24GB; `Standard_NV72ads_A10_v5`, 2 x A10), make the following changes:
93
94
94
95
- Modify the deepspeed config file `ds_z3_bf16_config.json` to configure optimizer offload. Within the `"zero_optimization"` section, add:
95
96
```
@@ -100,23 +101,21 @@ To run on A10 instances (ex: `g5.24xlarge`, 4 x A10 24GB; `Standard_NV72ads_A10_
100
101
```
101
102
- Set the `num_gpus` widget in `train_dolly` to the number of GPUs in your instance, such as 2 or 4, before running
102
103
103
-
With 4 A10s, an epoch completes in about 7 hours.
104
+
To train the 2.8B param model:
105
+
106
+
- Instead, simply set `per-device-train-batch-size` and `per-device-eval-batch-size` to 2 in the `train_dolly.py` invocation of `deepspeed`
104
107
105
108
### V100 GPUs
106
109
107
-
To run on V100 instances with 32GB of GPU memory (ex: `p3dn.24xlarge` or `Standard_ND40rs_v2`), make the following changes:
110
+
To run on V100 instances with 32GB of GPU memory (ex: `p3dn.24xlarge` or `Standard_ND40rs_v2`), follow instructions above, and add:
108
111
109
-
- Modify the deepspeed config to enable optimizer offload, as above
110
-
- Modify `trainer.py` to disable `bf16` and enable `fp16` in `TrainingArguments`:
112
+
- Modify `training/trainer.py` to disable `bf16` and enable `fp16` in `TrainingArguments`:
111
113
```
112
114
...
113
115
fp16=True,
114
116
bf16=False,
115
117
...
116
118
```
117
-
- Set the `num_gpus` widget in `train_dolly` to the number of GPUs in your instance, typically 8
118
-
119
-
With 8 V100s, an epoch completes in about 3.5 hours. Note that the resulting model may be slightly different when trained with `fp16` versus `bf16`.
0 commit comments