Merge pull request #123 from srowen/Dataset15k

srowen · web-flow · commit 34c0bd289ca4 · 2023-04-21T18:31:02.000-05:00
Reference HF dataset by default, now that it's live
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 Databricks’ [Dolly](https://huggingface.co/databricks/dolly-v2-12b) is an instruction-following large language model trained on the Databricks machine learning platform
 that is licensed for commercial use. Based on `pythia-12b`, Dolly is trained on ~15k instruction/response fine tuning records
-[`databricks-dolly-15k`](https://github.com/databrickslabs/dolly/tree/master/data) generated
+[`databricks-dolly-15k`](https://huggingface.co/datasets/databricks/databricks-dolly-15k) generated
 by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation,
 information extraction, open QA and summarization. `dolly-v2-12b` is not a state-of-the-art model, but does exhibit surprisingly
 high quality instruction following behavior not characteristic of the foundation model on which it is based.
diff --git a/data/README.md b/data/README.md
@@ -4,6 +4,8 @@ Blog post: [Free Dolly: Introducing the World's First Truly Open Instruction-Tun
 
 `databricks-dolly-15k` is an open source dataset of instruction-following records used in training [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT](https://arxiv.org/abs/2203.02155) paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
 
+It is also available on Hugging Face Datasets as [`databricks/databricks-dolly-15k`](https://huggingface.co/datasets/databricks/databricks-dolly-15k).
+
 This dataset can be used for any purpose, whether academic or commercial,  under the terms of the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/legalcode).
 
 Supported Tasks:
diff --git a/training/trainer.py b/training/trainer.py
@@ -42,7 +42,6 @@
 
 logger = logging.getLogger(__name__)
 ROOT_PATH = Path(__file__).parent.parent
-DATABRICKS_DOLLY_15K_PATH = ROOT_PATH / "data" / "databricks-dolly-15k.jsonl"
 
 
 class DataCollatorForCompletionOnlyLM(DataCollatorForLanguageModeling):
@@ -85,9 +84,9 @@ def preprocess_batch(batch: Dict[str, List], tokenizer: AutoTokenizer, max_lengt
     )
 
 
-def load_training_dataset() -> Dataset:
-    logger.info(f"Loading dataset from {DATABRICKS_DOLLY_15K_PATH}")
-    dataset = load_dataset("json", data_files=str(DATABRICKS_DOLLY_15K_PATH))["train"]
+def load_training_dataset(path_or_dataset: str = "databricks/databricks-dolly-15k") -> Dataset:
+    logger.info(f"Loading dataset from {path_or_dataset}")
+    dataset = load_dataset(path_or_dataset)["train"]
     logger.info("Found %d rows", dataset.num_rows)
 
     def _add_text(rec):