Suggest bfloat16 and add generation notes for A10, V100 (#95)

srowen · web-flow · commit dcef926686b3 · 2023-04-16T21:58:00.000-07:00
I propose we explicitly show loading in bf16 over fp32 in the generation example and in the generation code. This helps avoid OOM for sure in many usages.

I add additional notes on getting generation to work on A10, V100 GPUs with 8-bit.
diff --git a/README.md b/README.md
@@ -51,12 +51,13 @@ maximize the potential of all individuals and organizations.
 
 If you'd like to simply test the model without training, the model is available on Hugging Face as [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b).
 
-To use the model with the `transformers` library on a machine with GPUs:
+To use the model with the `transformers` library on a machine with A100 GPUs:
 
 ```
 from transformers import pipeline
+import torch
 
-instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", trust_remote_code=True, device_map="auto")
+instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
 ```
 
 You can then use the pipeline to answer instructions:
@@ -65,28 +66,38 @@ You can then use the pipeline to answer instructions:
 instruct_pipeline("Explain to me the difference between nuclear fission and fusion.")
 ```
 
-To reduce memory usage you can load the model with `bfloat16`:
+### Generating on Other Instances
 
-```
-import torch
-from transformers import pipeline
+A100 instance types are not available in all cloud regions, or can be hard to provision. Inference is possible on other GPU instance types.
 
-instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
-```
+#### A10 GPUs
+
+The 6.9B and 2.8B param models should work as-is.
+
+To generate using the 12B param model on A10s (ex: `g5.4xlarge`, 1 x A10 24GB), it's necessary to load and run generating using 8-bit weights, which impacts the results slightly:
+
+- Also install `bitsandbytes`
+- Add `model_kwargs={'load_in_8bit': True}` to the `pipeline()` command shown above
+
+#### V100 GPUs
+
+When using V100s (ex: `p3.2xlarge`, 1 x V100 16GB, `NC6s_v3`), in all cases, set `torch_dtype=torch.float16` in `pipeline()` instead.
+
+Otherwise, follow the steps above. The 12B param model may not function well in 8-bit on V100s.
 
 ## Getting Started with Training
 
-* Add the `dolly` repo to Databricks (under Repos click Add Repo, enter `https://github.com/databrickslabs/dolly.git`, then click Create Repo).
-* Start a `12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)` single-node cluster with node type having 8 A100 GPUs (e.g. `Standard_ND96asr_v4` or `p4d.24xlarge`). Note that these instance types may not be available in all regions, or may be difficult to provision. In Databricks, note that you must select the GPU runtime first, and unselect "Use Photon", for these instance types to appear (where supported).
-* Open the `train_dolly` notebook in the Repo (which is the `train_dolly.py` file in the Github `dolly` repo), attach to your GPU cluster, and run all cells.  When training finishes, the notebook will save the model under `/dbfs/dolly_training`.
+- Add the `dolly` repo to Databricks (under Repos click Add Repo, enter `https://github.com/databrickslabs/dolly.git`, then click Create Repo).
+- Start a `12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)` single-node cluster with node type having 8 A100 GPUs (e.g. `Standard_ND96asr_v4` or `p4d.24xlarge`). Note that these instance types may not be available in all regions, or may be difficult to provision. In Databricks, note that you must select the GPU runtime first, and unselect "Use Photon", for these instance types to appear (where supported).
+- Open the `train_dolly` notebook in the Repo (which is the `train_dolly.py` file in the Github `dolly` repo), attach to your GPU cluster, and run all cells.  When training finishes, the notebook will save the model under `/dbfs/dolly_training`.
 
-## Training on Other Instances
+### Training on Other Instances
 
 A100 instance types are not available in all cloud regions, or can be hard to provision. Training is possible on other GPU instance types, 
 for smaller Dolly model sizes, and with small modifications to reduce memory usage.
 These modifications are not optimal, but are simple to make.
 
-### A10 GPUs
+#### A10 GPUs
 
 Training the 12B param model is not recommended on A10s.
 
@@ -105,7 +116,7 @@ To train the 2.8B param model:
 
 - Instead, simply set `per-device-train-batch-size` and `per-device-eval-batch-size` to 2 in the `train_dolly.py` invocation of `deepspeed`
 
-### V100 GPUs
+#### V100 GPUs
 
 To run on V100 instances with 32GB of GPU memory (ex: `p3dn.24xlarge` or `Standard_ND40rs_v2`), follow instructions above, and add:
 
diff --git a/training/generate.py b/training/generate.py
@@ -1,6 +1,7 @@
 import logging
 import re
 from typing import List, Tuple
+import torch
 
 import numpy as np
 from transformers import (
@@ -34,7 +35,7 @@ def load_model_tokenizer_for_generate(
     """
     tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="left")
     model = AutoModelForCausalLM.from_pretrained(
-        pretrained_model_name_or_path, device_map="auto", trust_remote_code=True
+        pretrained_model_name_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True
     )
     return model, tokenizer