diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 55a4046..38b13d9 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,12 +7,14 @@ There are `TODOs` that better enhance the reproducability and accuracy of datase ## Adding a dataset -To add a dataset (see `datasets/yeast-osmotic-stress` as an example of a dataset): +See `datasets/diseases` as an example of a dataset. Datasets take some form of raw data from an online service and convert it into usable datasets +with associated gold standards for SPRAS to run on. + +To add a dataset: 1. Check that your dataset provider isn't already added (some of these datasets act as providers for multiple datasets) 1. Create a new folder under `datasets/` -1. Add a `raw` folder containing your data 1. Add an attached Snakefile that converts your `raw` data to `processed` data. - - Make sure to use `uv` here. See `yeast-osmotic-stress`'s Snakefile for an example. + - Make sure to use `uv` here. See `diseases`'s Snakefile for an example. 1. Add your Snakefile to the top-level `run_snakemake.sh` file. 1. Add your datasets to the appropiate `configs` - If your dataset has gold standards, make sure to include them here. diff --git a/README.md b/README.md index 9e06a9d..c98180a 100644 --- a/README.md +++ b/README.md @@ -26,16 +26,24 @@ To run the benchmarking pipeline, use: snakemake --cores 1 --configfile configs/dmmm.yaml --show-failed-logs -s spras/Snakefile ``` +To run an individual dataset pipeline, run the respective `Snakefile` in the dataset directory using [uv](https://docs.astral.sh/uv/): + +```sh +cd datasets/[dataset] +uv run snakemake --cores 1 +``` + > [!NOTE] > Each one of the dataset categories (at the time of writing, DMMM and PRA) are split into different configuration files. > Run each one as you would want. ## Organization -There are four primary folders in this repository: +There are five primary folders in this repository: ``` . +├── cache ├── configs ├── datasets ├── spras @@ -44,7 +52,8 @@ There are four primary folders in this repository: `spras` is the cloned submodule of [SPRAS](https://github.com/reed-compbio/spras), `web` is an [astro](https://astro.build/) app which generates the `spras-benchmarking` [output](https://reed-compbio.github.io/spras-benchmarking/), -`configs` is the YAML file used to talk to SPRAS, and `datasets` contains the raw data. +`configs` is the YAML file used to talk to SPRAS, and `datasets` contains the raw data. `cache` is utility for `datasets` which provides a convenient +way to fetch online files for further processing. The workflow runs as so: diff --git a/datasets/diseases/README.md b/datasets/diseases/README.md index 5526cd0..9ee4855 100644 --- a/datasets/diseases/README.md +++ b/datasets/diseases/README.md @@ -56,6 +56,6 @@ By our count, we have 41 diseases that pass these filters, and have 10 or more h - Retain the DO-gene associations for the 41 diseases from the gold standard dataset. (We discussed a version 2 where we also run DO-gene associations for diseases _not_ in the validation set; that's a later project). **C. SPRAS Inputs**: -- Use the STRING-DB interactome (there is a benchmark file for the DISEASES database with STRINGv9.1, but we might want to use the most recent STRING version). +- Use the STRING-DB interactome (there is a benchmark file for the DISEASES database with STRINGv9.1, but we use the most recent STRING version). - Each of the 41 diseases will be a separate node prizes dataset. For each disease, convert the snp_w scores into prizes and make a `node-prizes.txt` file. - Each of the 41 diseases will have a validation dataset, comprising of the high confidence diseases-gene pairs from the DISEASES text mining and/or knowledge channels. They have a score (a 4 or a 5), but I assumed we would consider them all "high confidence" and thus a gene set. diff --git a/datasets/diseases/scripts/gold_standard.py b/datasets/diseases/scripts/gold_standard.py index a60b20c..846eaba 100644 --- a/datasets/diseases/scripts/gold_standard.py +++ b/datasets/diseases/scripts/gold_standard.py @@ -59,7 +59,7 @@ def main(): GS_count_threshold = {k: v for (k, v) in GS_score_count.items() if (v > 10)} GS_combined_threshold = GS_score_threshold.loc[GS_score_threshold["diseaseName"].isin(list(GS_count_threshold.keys()))] - # Mapping ENSG IDs to STRING IDs through the STRING aliases file + # Mapping ENSG IDs to ENSP IDs through the STRING aliases file # given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`, # NOTE: the STRING API call to map genes to proteins diff --git a/datasets/diseases/scripts/inputs.py b/datasets/diseases/scripts/inputs.py index ba35396..8dc6214 100644 --- a/datasets/diseases/scripts/inputs.py +++ b/datasets/diseases/scripts/inputs.py @@ -22,8 +22,8 @@ def main(): tiga_do = tiga.merge(human_do, left_on="trait", right_on="label", how="inner", validate="many_to_one") - # Mapping ENSG IDs to STRING IDs through the STRING aliases file - # given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`, + # Mapping ENSG IDs to ENSP IDs through the STRING aliases file + # given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`. string_aliases = pd.read_csv(diseases_path / "raw" / "9606.protein.aliases.txt", sep="\t", usecols=["#string_protein_id", "alias"]) string_aliases.columns = ["str_id", "ENSP"] string_aliases = string_aliases.drop_duplicates()