Skip to content

Commit 34c0bd2

Browse files
authored
Merge pull request #123 from srowen/Dataset15k
Reference HF dataset by default, now that it's live
2 parents 3fd1286 + 255c149 commit 34c0bd2

File tree

3 files changed

+6
-5
lines changed

3 files changed

+6
-5
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Databricks’ [Dolly](https://huggingface.co/databricks/dolly-v2-12b) is an instruction-following large language model trained on the Databricks machine learning platform
44
that is licensed for commercial use. Based on `pythia-12b`, Dolly is trained on ~15k instruction/response fine tuning records
5-
[`databricks-dolly-15k`](https://github.com/databrickslabs/dolly/tree/master/data) generated
5+
[`databricks-dolly-15k`](https://huggingface.co/datasets/databricks/databricks-dolly-15k) generated
66
by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation,
77
information extraction, open QA and summarization. `dolly-v2-12b` is not a state-of-the-art model, but does exhibit surprisingly
88
high quality instruction following behavior not characteristic of the foundation model on which it is based.

data/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ Blog post: [Free Dolly: Introducing the World's First Truly Open Instruction-Tun
44

55
`databricks-dolly-15k` is an open source dataset of instruction-following records used in training [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT](https://arxiv.org/abs/2203.02155) paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
66

7+
It is also available on Hugging Face Datasets as [`databricks/databricks-dolly-15k`](https://huggingface.co/datasets/databricks/databricks-dolly-15k).
8+
79
This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode).
810

911
Supported Tasks:

training/trainer.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@
4242

4343
logger = logging.getLogger(__name__)
4444
ROOT_PATH = Path(__file__).parent.parent
45-
DATABRICKS_DOLLY_15K_PATH = ROOT_PATH / "data" / "databricks-dolly-15k.jsonl"
4645

4746

4847
class DataCollatorForCompletionOnlyLM(DataCollatorForLanguageModeling):
@@ -85,9 +84,9 @@ def preprocess_batch(batch: Dict[str, List], tokenizer: AutoTokenizer, max_lengt
8584
)
8685

8786

88-
def load_training_dataset() -> Dataset:
89-
logger.info(f"Loading dataset from {DATABRICKS_DOLLY_15K_PATH}")
90-
dataset = load_dataset("json", data_files=str(DATABRICKS_DOLLY_15K_PATH))["train"]
87+
def load_training_dataset(path_or_dataset: str = "databricks/databricks-dolly-15k") -> Dataset:
88+
logger.info(f"Loading dataset from {path_or_dataset}")
89+
dataset = load_dataset(path_or_dataset)["train"]
9190
logger.info("Found %d rows", dataset.num_rows)
9291

9392
def _add_text(rec):

0 commit comments

Comments
 (0)