Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@ There are `TODOs` that better enhance the reproducability and accuracy of datase

## Adding a dataset

To add a dataset (see `datasets/yeast-osmotic-stress` as an example of a dataset):
See `datasets/diseases` as an example of a dataset. Datasets take some form of raw data from an online service and convert it into usable datasets
with associated gold standards for SPRAS to run on.

To add a dataset:
1. Check that your dataset provider isn't already added (some of these datasets act as providers for multiple datasets)
1. Create a new folder under `datasets/<your-dataset>`
1. Add a `raw` folder containing your data
1. Add an attached Snakefile that converts your `raw` data to `processed` data.
- Make sure to use `uv` here. See `yeast-osmotic-stress`'s Snakefile for an example.
- Make sure to use `uv` here. See `diseases`'s Snakefile for an example.
1. Add your Snakefile to the top-level `run_snakemake.sh` file.
1. Add your datasets to the appropiate `configs`
- If your dataset has gold standards, make sure to include them here.
Expand Down
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,24 @@ To run the benchmarking pipeline, use:
snakemake --cores 1 --configfile configs/dmmm.yaml --show-failed-logs -s spras/Snakefile
```

To run an individual dataset pipeline, run the respective `Snakefile` in the dataset directory using [uv](https://docs.astral.sh/uv/):

```sh
cd datasets/[dataset]
uv run snakemake --cores 1
```

> [!NOTE]
> Each one of the dataset categories (at the time of writing, DMMM and PRA) are split into different configuration files.
> Run each one as you would want.

## Organization

There are four primary folders in this repository:
There are five primary folders in this repository:

```
.
├── cache
├── configs
├── datasets
├── spras
Expand All @@ -44,7 +52,8 @@ There are four primary folders in this repository:

`spras` is the cloned submodule of [SPRAS](https://github.com/reed-compbio/spras), `web` is an
[astro](https://astro.build/) app which generates the `spras-benchmarking` [output](https://reed-compbio.github.io/spras-benchmarking/),
`configs` is the YAML file used to talk to SPRAS, and `datasets` contains the raw data.
`configs` is the YAML file used to talk to SPRAS, and `datasets` contains the raw data. `cache` is utility for `datasets` which provides a convenient
way to fetch online files for further processing.

The workflow runs as so:

Expand Down
2 changes: 1 addition & 1 deletion datasets/diseases/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,6 @@ By our count, we have 41 diseases that pass these filters, and have 10 or more h
- Retain the DO-gene associations for the 41 diseases from the gold standard dataset. (We discussed a version 2 where we also run DO-gene associations for diseases _not_ in the validation set; that's a later project).

**C. SPRAS Inputs**:
- Use the STRING-DB interactome (there is a benchmark file for the DISEASES database with STRINGv9.1, but we might want to use the most recent STRING version).
- Use the STRING-DB interactome (there is a benchmark file for the DISEASES database with STRINGv9.1, but we use the most recent STRING version).
- Each of the 41 diseases will be a separate node prizes dataset. For each disease, convert the snp_w scores into prizes and make a `node-prizes.txt` file.
- Each of the 41 diseases will have a validation dataset, comprising of the high confidence diseases-gene pairs from the DISEASES text mining and/or knowledge channels. They have a score (a 4 or a 5), but I assumed we would consider them all "high confidence" and thus a gene set.
2 changes: 1 addition & 1 deletion datasets/diseases/scripts/gold_standard.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def main():
GS_count_threshold = {k: v for (k, v) in GS_score_count.items() if (v > 10)}
GS_combined_threshold = GS_score_threshold.loc[GS_score_threshold["diseaseName"].isin(list(GS_count_threshold.keys()))]

# Mapping ENSG IDs to STRING IDs through the STRING aliases file
# Mapping ENSG IDs to ENSP IDs through the STRING aliases file
# given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`,

# NOTE: the STRING API call to map genes to proteins
Expand Down
4 changes: 2 additions & 2 deletions datasets/diseases/scripts/inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ def main():

tiga_do = tiga.merge(human_do, left_on="trait", right_on="label", how="inner", validate="many_to_one")

# Mapping ENSG IDs to STRING IDs through the STRING aliases file
# given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`,
# Mapping ENSG IDs to ENSP IDs through the STRING aliases file
# given our ENSG and ENSP (non one-to-one!) mapping `string_aliases`.
string_aliases = pd.read_csv(diseases_path / "raw" / "9606.protein.aliases.txt", sep="\t", usecols=["#string_protein_id", "alias"])
string_aliases.columns = ["str_id", "ENSP"]
string_aliases = string_aliases.drop_duplicates()
Expand Down
Loading