Skip to content

Commit 905e58a

Browse files
authored
Dolly V2 Updates (#88)
This updates training to use the [`databricks-dolly-15k`](https://github.com/databrickslabs/dolly/tree/master/data) dataset. It also includes improvements to text generation and example notebooks. Key Changes: * The `train_dolly.py` notebook now uses Pythia models as the input models and fine tunes using the [`databricks-dolly-15k`](https://github.com/databrickslabs/dolly/tree/master/data) dataset. * Added `InstructionTextGenerationPipeline` for text generation. This is derived from the code in the model repo, [instruct_pipeline.py](https://huggingface.co/databricks/dolly-v2-12b/blob/main/instruct_pipeline.py). It has been improved so that it is compatible with the `TextGenerationPipeline` from the `transformers` library. Some code, such as that in `_forward`, was copied from that pipeline to help with compatibility. The biggest change relative to the current `instruct_pipeline.py` version is that it returns a list of dicts per instruction, rather than just a dict. It also now has a `return_full_text` option. Both of these contribute towards being usable with `langchain`. * `generate_response` is now a wrapper around `InstructionTextGenerationPipeline`, as the code was all moved there. * `trainer.py` now uses the local `databricks-dolly-15k.jsonl` dataset. A `text` column has been constructed from the instruction, context, and response. Minor Changes: * Added an `experiment_id` widget to help keep track of different models that are fine tuned. * Added more options to CLI for configuring training. Additional Changes: * Added a `generation.py` example notebook that uses `generate_response` on a couple instructions. * Added a `langchain.py` example notebook that uses `HuggingFacePipeline ` from `langchain` and `InstructionTextGenerationPipeline ` to test instructions both with and without context. * Added a `pipeline.py` example notebook that uses `InstructionTextGenerationPipeline` to generate multiple samples per instruction.
1 parent 3ea242c commit 905e58a

File tree

8 files changed

+560
-112
lines changed

8 files changed

+560
-112
lines changed

examples/generation.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Databricks notebook source
2+
# MAGIC %md
3+
# MAGIC ## Generation Example
4+
# MAGIC
5+
# MAGIC This takes a pretrained Dolly model, either from Hugging face or from a local path, and runs generation with it
6+
# MAGIC using the code from this repo.
7+
# MAGIC
8+
# MAGIC The model to load for generation is controlled by `input_model`. The default options are the pretrained
9+
# MAGIC Dolly models shared on Hugging Face. Alternatively, the path to a local model that has been trained using the
10+
# MAGIC `train_dolly` notebook can also be used.
11+
12+
# COMMAND ----------
13+
14+
# MAGIC %pip install -r ../requirements.txt
15+
16+
# COMMAND ----------
17+
18+
# MAGIC %load_ext autoreload
19+
# MAGIC %autoreload 2
20+
21+
default_model = "databricks/dolly-v2-3b"
22+
23+
suggested_models = [
24+
"databricks/dolly-v1-6b",
25+
"databricks/dolly-v2-3b",
26+
"databricks/dolly-v2-7b",
27+
"databricks/dolly-v2-12b",
28+
]
29+
30+
dbutils.widgets.combobox("input_model", default_model, suggested_models, "input_model")
31+
32+
# COMMAND ----------
33+
34+
from training.generate import generate_response, load_model_tokenizer_for_generate
35+
36+
input_model = dbutils.widgets.get("input_model")
37+
38+
model, tokenizer = load_model_tokenizer_for_generate(input_model)
39+
40+
# COMMAND ----------
41+
42+
# Examples from https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
43+
instructions = [
44+
"Explain to me the difference between nuclear fission and fusion.",
45+
"Give me a list of 5 science fiction books I should read next.",
46+
]
47+
48+
# Use the model to generate responses for each of the instructions above.
49+
for instruction in instructions:
50+
response = generate_response(instruction, model=model, tokenizer=tokenizer)
51+
if response:
52+
print(f"Instruction: {instruction}\n\n{response}\n\n-----------\n")

examples/langchain.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Databricks notebook source
2+
# MAGIC %md
3+
# MAGIC ## Langchain Example
4+
# MAGIC
5+
# MAGIC This takes a pretrained Dolly model, either from Hugging face or from a local path, and uses langchain
6+
# MAGIC to run generation.
7+
# MAGIC
8+
# MAGIC The model to load for generation is controlled by `input_model`. The default options are the pretrained
9+
# MAGIC Dolly models shared on Hugging Face. Alternatively, the path to a local model that has been trained using the
10+
# MAGIC `train_dolly` notebook can also be used.
11+
12+
# COMMAND ----------
13+
14+
# MAGIC %pip install -r ../requirements.txt
15+
16+
# COMMAND ----------
17+
18+
# MAGIC %load_ext autoreload
19+
# MAGIC %autoreload 2
20+
21+
# COMMAND ----------
22+
23+
default_model = "databricks/dolly-v2-3b"
24+
25+
suggested_models = [
26+
"databricks/dolly-v1-6b",
27+
"databricks/dolly-v2-3b",
28+
"databricks/dolly-v2-7b",
29+
"databricks/dolly-v2-12b",
30+
]
31+
32+
dbutils.widgets.combobox("input_model", default_model, suggested_models, "input_model")
33+
34+
# COMMAND ----------
35+
36+
from training.generate import InstructionTextGenerationPipeline, load_model_tokenizer_for_generate
37+
38+
input_model = dbutils.widgets.get("input_model")
39+
40+
model, tokenizer = load_model_tokenizer_for_generate(input_model)
41+
42+
# COMMAND ----------
43+
44+
from langchain import PromptTemplate, LLMChain
45+
from langchain.llms import HuggingFacePipeline
46+
47+
# template for an instrution with no input
48+
prompt = PromptTemplate(
49+
input_variables=["instruction"],
50+
template="{instruction}")
51+
52+
# template for an instruction with input
53+
prompt_with_context = PromptTemplate(
54+
input_variables=["instruction", "context"],
55+
template="{instruction}\n\nInput:\n{context}")
56+
57+
hf_pipeline = HuggingFacePipeline(
58+
pipeline=InstructionTextGenerationPipeline(
59+
# Return the full text, because this is what the HuggingFacePipeline expects.
60+
model=model, tokenizer=tokenizer, return_full_text=True, task="text-generation"))
61+
62+
llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
63+
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
64+
65+
# COMMAND ----------
66+
67+
# Examples from https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
68+
instructions = [
69+
"Explain to me the difference between nuclear fission and fusion.",
70+
"Give me a list of 5 science fiction books I should read next.",
71+
]
72+
73+
# Use the model to generate responses for each of the instructions above.
74+
for instruction in instructions:
75+
response = llm_chain.predict(instruction=instruction)
76+
print(f"Instruction: {instruction}\n\n{response}\n\n-----------\n")
77+
78+
# COMMAND ----------
79+
80+
context = (
81+
"""George Washington (February 22, 1732[b] – December 14, 1799) was an American military officer, statesman, """
82+
"""and Founding Father who served as the first president of the United States from 1789 to 1797. Appointed by """
83+
"""the Continental Congress as commander of the Continental Army, Washington led Patriot forces to victory in """
84+
"""the American Revolutionary War and served as president of the Constitutional Convention of 1787, which """
85+
"""created and ratified the Constitution of the United States and the American federal government. Washington """
86+
"""has been called the "Father of his Country" for his manifold leadership in the nation's founding."""
87+
)
88+
89+
instruction = "When did George Washinton serve as president of the Constitutional Convention?"
90+
91+
response = llm_context_chain.predict(instruction=instruction, context=context)
92+
print(f"Instruction: {instruction}\n\nContext:\n{context}\n\nResponse:\n{response}\n\n-----------\n")

examples/pipeline.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Databricks notebook source
2+
# MAGIC %md
3+
# MAGIC ## Pipeline Example
4+
# MAGIC
5+
# MAGIC This takes a pretrained Dolly model, either from Hugging face or from a local path, and uses the pipeline from
6+
# MAGIC this repo to perform generation.
7+
# MAGIC
8+
# MAGIC The model to load for generation is controlled by `input_model`. The default options are the pretrained
9+
# MAGIC Dolly models shared on Hugging Face. Alternatively, the path to a local model that has been trained using the
10+
# MAGIC `train_dolly` notebook can also be used.
11+
12+
# COMMAND ----------
13+
14+
# MAGIC %pip install -r ../requirements.txt
15+
16+
# COMMAND ----------
17+
18+
# MAGIC %load_ext autoreload
19+
# MAGIC %autoreload 2
20+
21+
default_model = "databricks/dolly-v2-3b"
22+
23+
suggested_models = [
24+
"databricks/dolly-v1-6b",
25+
"databricks/dolly-v2-3b",
26+
"databricks/dolly-v2-7b",
27+
"databricks/dolly-v2-12b",
28+
]
29+
30+
dbutils.widgets.combobox("input_model", default_model, suggested_models, "input_model")
31+
32+
# COMMAND ----------
33+
34+
from training.generate import InstructionTextGenerationPipeline, load_model_tokenizer_for_generate
35+
36+
input_model = dbutils.widgets.get("input_model")
37+
38+
model, tokenizer = load_model_tokenizer_for_generate(input_model)
39+
40+
# COMMAND ----------
41+
42+
generation_pipeline = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
43+
44+
# Examples from https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
45+
instructions = [
46+
"Explain to me the difference between nuclear fission and fusion.",
47+
"Give me a list of 5 science fiction books I should read next.",
48+
]
49+
50+
# Use the model to generate responses for each of the instructions above.
51+
for instruction in instructions:
52+
results = generation_pipeline(instruction, num_return_sequences=2)
53+
54+
print(f"Instruction: {instruction}\n")
55+
for i, res in enumerate(results, 1):
56+
text = res["generated_text"]
57+
print(f"Sample #{i}:\n{text}\n")
58+
print("-----------\n")

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@ click==8.0.3
33
datasets==2.8.0
44
deepspeed==0.8.0
55
transformers[torch]==4.25.1
6-
watchdog==2.1.9
6+
langchain>=0.0.139

train_dolly.py

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,11 @@
22
# MAGIC %md
33
# MAGIC ## Train Dolly
44
# MAGIC
5-
# MAGIC This fine-tunes the [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) model on
6-
# MAGIC the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) dataset.
5+
# MAGIC This fine-tunes EleutherAI Pythia models
6+
# MAGIC (e.g. [pythia-2.8b](https://huggingface.co/EleutherAI/pythia-6.9b),
7+
# MAGIC [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b), or
8+
# MAGIC [pythia-12b](https://huggingface.co/EleutherAI/pythia-12b)) on
9+
# MAGIC the [databricks-dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data) dataset.
710
# MAGIC
811
# MAGIC ```
912
# MAGIC Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,8 +22,10 @@
1922
# MAGIC limitations under the License.
2023
# MAGIC ```
2124
# MAGIC
22-
# MAGIC Please note that while GPT-J 6B is [Apache 2.0 licensed](https://huggingface.co/EleutherAI/gpt-j-6B),
23-
# MAGIC the Alpaca dataset is licensed under [Creative Commons NonCommercial (CC BY-NC 4.0)](https://huggingface.co/datasets/tatsu-lab/alpaca).
25+
# MAGIC The EleutherAI Pythia models are [Apache 2.0 licensed](https://huggingface.co/EleutherAI/gpt-j-6B) and
26+
# MAGIC the [databricks-dolly-15k](https://github.com/databrickslabs/dolly/tree/master/data) is licensed under the terms
27+
# MAGIC of [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode),
28+
# MAGIC which means it can be used for either academic or commercial purposes.
2429

2530
# COMMAND ----------
2631

@@ -55,12 +60,16 @@
5560
# COMMAND ----------
5661

5762
import os
63+
import re
5864
from datetime import datetime
65+
from training.consts import DEFAULT_INPUT_MODEL, SUGGESTED_INPUT_MODELS
5966
from training.trainer import load_training_dataset, load_tokenizer
6067

68+
dbutils.widgets.combobox("input_model", DEFAULT_INPUT_MODEL, SUGGESTED_INPUT_MODELS, "input_model")
6169
dbutils.widgets.text("num_gpus", "", "num_gpus")
6270
dbutils.widgets.text("local_training_root", "", "local_training_root")
6371
dbutils.widgets.text("dbfs_output_root", "", "dbfs_output_root")
72+
dbutils.widgets.text("experiment_id", "", "experiment_id")
6473

6574
# COMMAND ----------
6675

@@ -72,6 +81,14 @@
7281

7382
timestamp = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
7483
model_name = "dolly"
84+
85+
experiment_id = dbutils.widgets.get("experiment_id")
86+
input_model = dbutils.widgets.get("input_model")
87+
88+
if experiment_id:
89+
experiment_id = re.sub(r"\s+", "_", experiment_id.strip())
90+
model_name = f"{model_name}__{experiment_id}"
91+
7592
checkpoint_dir_name = f"{model_name}__{timestamp}"
7693

7794
root_path = os.getcwd()
@@ -122,13 +139,20 @@
122139

123140
# MAGIC !deepspeed {num_gpus_flag} \
124141
# MAGIC --module training.trainer \
142+
# MAGIC --input-model {input_model} \
125143
# MAGIC --deepspeed {deepspeed_config} \
126-
# MAGIC --epochs 1 \
144+
# MAGIC --epochs 2 \
127145
# MAGIC --local-output-dir {local_output_dir} \
128146
# MAGIC --dbfs-output-dir {dbfs_output_dir} \
129-
# MAGIC --per-device-train-batch-size 8 \
130-
# MAGIC --per-device-eval-batch-size 8 \
131-
# MAGIC --lr 1e-5
147+
# MAGIC --per-device-train-batch-size 6 \
148+
# MAGIC --per-device-eval-batch-size 6 \
149+
# MAGIC --logging-steps 10 \
150+
# MAGIC --save-steps 200 \
151+
# MAGIC --save-total-limit 20 \
152+
# MAGIC --eval-steps 50 \
153+
# MAGIC --warmup-steps 50 \
154+
# MAGIC --test-size 200 \
155+
# MAGIC --lr 5e-6
132156

133157
# COMMAND ----------
134158

training/consts.py

Lines changed: 67 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,74 @@
1-
DEFAULT_TRAINING_DATASET = "tatsu-lab/alpaca"
2-
DEFAULT_INPUT_MODEL = "EleutherAI/gpt-j-6B"
3-
END_KEY = "### End"
1+
DEFAULT_INPUT_MODEL = "EleutherAI/pythia-6.9b"
2+
SUGGESTED_INPUT_MODELS = [
3+
"EleutherAI/pythia-2.8b",
4+
"EleutherAI/pythia-6.9b",
5+
"EleutherAI/pythia-12b",
6+
"EleutherAI/gpt-j-6B",
7+
]
8+
INTRO_BLURB = (
9+
"Below is an instruction that describes a task. Write a response that appropriately completes the request."
10+
)
411
INSTRUCTION_KEY = "### Instruction:"
5-
RESPONSE_KEY_NL = f"### Response:\n"
12+
INPUT_KEY = "Input:"
13+
RESPONSE_KEY = "### Response:"
14+
END_KEY = "### End"
15+
RESPONSE_KEY_NL = f"{RESPONSE_KEY}\n"
616
DEFAULT_SEED = 42
717

8-
# The format of the instruction the model has been trained on.
9-
PROMPT_FORMAT = """%s
18+
# This is a training prompt that does not contain an input string. The instruction by itself has enough information
19+
# to respond. For example, the instruction might ask for the year a historic figure was born.
20+
PROMPT_NO_INPUT_FORMAT = """{intro}
1021
11-
%s
22+
{instruction_key}
1223
{instruction}
1324
14-
%s""" % (
15-
"Below is an instruction that describes a task. Write a response that appropriately completes the request.",
16-
INSTRUCTION_KEY,
17-
RESPONSE_KEY_NL,
25+
{response_key}
26+
{response}
27+
28+
{end_key}""".format(
29+
intro=INTRO_BLURB,
30+
instruction_key=INSTRUCTION_KEY,
31+
instruction="{instruction}",
32+
response_key=RESPONSE_KEY,
33+
response="{response}",
34+
end_key=END_KEY,
1835
)
36+
37+
# This is a training prompt that contains an input string that serves as context for the instruction. For example,
38+
# the input might be a passage from Wikipedia and the intruction is to extract some information from it.
39+
PROMPT_WITH_INPUT_FORMAT = """{intro}
40+
41+
{instruction_key}
42+
{instruction}
43+
44+
{input_key}
45+
{input}
46+
47+
{response_key}
48+
{response}
49+
50+
{end_key}""".format(
51+
intro=INTRO_BLURB,
52+
instruction_key=INSTRUCTION_KEY,
53+
instruction="{instruction}",
54+
input_key=INPUT_KEY,
55+
input="{input}",
56+
response_key=RESPONSE_KEY,
57+
response="{response}",
58+
end_key=END_KEY,
59+
)
60+
61+
# This is the prompt that is used for generating responses using an already trained model. It ends with the response
62+
# key, where the job of the model is to provide the completion that follows it (i.e. the response itself).
63+
PROMPT_FOR_GENERATION_FORMAT = """{intro}
64+
65+
{instruction_key}
66+
{instruction}
67+
68+
{response_key}
69+
""".format(
70+
intro=INTRO_BLURB,
71+
instruction_key=INSTRUCTION_KEY,
72+
instruction="{instruction}",
73+
response_key=RESPONSE_KEY,
74+
)

0 commit comments

Comments
 (0)