Reed-CompBio · tristan-f-r · Jan 16, 2026 · Jan 16, 2026
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -7,12 +7,14 @@ There are `TODOs` that better enhance the reproducability and accuracy of datase
 
 ## Adding a dataset
 
-To add a dataset (see `datasets/yeast-osmotic-stress` as an example of a dataset):
+See `datasets/diseases` as an example of a dataset. Datasets take some form of raw data from an online service and convert it into usable datasets
+with associated gold standards for SPRAS to run on.
+
+To add a dataset:
 1. Check that your dataset provider isn't already added (some of these datasets act as providers for multiple datasets)
 1. Create a new folder under `datasets/<your-dataset>`
-1. Add a `raw` folder containing your data
 1. Add an attached Snakefile that converts your `raw` data to `processed` data.
-    - Make sure to use `uv` here. See `yeast-osmotic-stress`'s Snakefile for an example.
+    - Make sure to use `uv` here. See `diseases`'s Snakefile for an example.
 1. Add your Snakefile to the top-level `run_snakemake.sh` file.
 1. Add your datasets to the appropiate `configs`
     - If your dataset has gold standards, make sure to include them here.

diff --git a/README.md b/README.md
@@ -26,16 +26,24 @@ To run the benchmarking pipeline, use:
 snakemake --cores 1 --configfile configs/dmmm.yaml --show-failed-logs -s spras/Snakefile
 ```
 
+To run an individual dataset pipeline, run the respective `Snakefile` in the dataset directory using [uv](https://docs.astral.sh/uv/):
+
+```sh
+cd datasets/[dataset]
+uv run snakemake --cores 1
+```
+
 > [!NOTE]
 > Each one of the dataset categories (at the time of writing, DMMM and PRA) are split into different configuration files.
 > Run each one as you would want.
 
 ## Organization
 
-There are four primary folders in this repository:
+There are five primary folders in this repository:
 
 ```
 .
+├── cache
 ├── configs
 ├── datasets
 ├── spras
@@ -44,7 +52,8 @@ There are four primary folders in this repository:
 
 `spras` is the cloned submodule of [SPRAS](https://github.com/reed-compbio/spras), `web` is an
 [astro](https://astro.build/) app which generates the `spras-benchmarking` [output](https://reed-compbio.github.io/spras-benchmarking/),
-`configs` is the YAML file used to talk to SPRAS, and `datasets` contains the raw data.
+`configs` is the YAML file used to talk to SPRAS, and `datasets` contains the raw data. `cache` is utility for `datasets` which provides a convenient
+way to fetch online files for further processing.
 
 The workflow runs as so:
 

diff --git a/datasets/diseases/README.md b/datasets/diseases/README.md
@@ -56,6 +56,6 @@ By our count, we have 41 diseases that pass these filters, and have 10 or more h
 - Retain the DO-gene associations for the 41 diseases from the gold standard dataset. (We discussed a version 2 where we also run DO-gene associations for diseases _not_ in the validation set; that's a later project).
 
 **C. SPRAS Inputs**:
-- Use the STRING-DB interactome (there is a benchmark file for the DISEASES database with STRINGv9.1, but we might want to use the most recent STRING version).
+- Use the STRING-DB interactome (there is a benchmark file for the DISEASES database with STRINGv9.1, but we use the most recent STRING version).
 - Each of the 41 diseases will be a separate node prizes dataset. For each disease, convert the snp_w scores into prizes and make a `node-prizes.txt` file.
 - Each of the 41 diseases will have a validation dataset, comprising of the high confidence diseases-gene pairs from the DISEASES text mining and/or knowledge channels. They have a score (a 4 or a 5), but I assumed we would consider them all "high confidence" and thus a gene set.
diff --git a/datasets/diseases/scripts/gold_standard.py b/datasets/diseases/scripts/gold_standard.py
@@ -59,7 +59,7 @@ def main():
     GS_count_threshold = {k: v for (k, v) in GS_score_count.items() if (v > 10)}
     GS_combined_threshold = GS_score_threshold.loc[GS_score_threshold["diseaseName"].isin(list(GS_count_threshold.keys()))]
 
-    # Mapping ENSG IDs to STRING IDs through the STRING aliases file
+    # Mapping ENSG IDs to ENSP IDs through the STRING aliases file
     # given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`,
 
     # NOTE: the STRING API call to map genes to proteins

diff --git a/datasets/diseases/scripts/inputs.py b/datasets/diseases/scripts/inputs.py
@@ -22,8 +22,8 @@ def main():
 
     tiga_do = tiga.merge(human_do, left_on="trait", right_on="label", how="inner", validate="many_to_one")
 
-    # Mapping ENSG IDs to STRING IDs through the STRING aliases file
-    # given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`,
+    # Mapping ENSG IDs to ENSP IDs through the STRING aliases file
+    # given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`.
     string_aliases = pd.read_csv(diseases_path / "raw" / "9606.protein.aliases.txt", sep="\t", usecols=["#string_protein_id", "alias"])
     string_aliases.columns = ["str_id", "ENSP"]
     string_aliases = string_aliases.drop_duplicates()