Skip to content

Commit 2ff30da

Browse files
committed
docs: on readme
I'm also going to change the way I feed in pathways into the Snakefile since it's weird locally.
1 parent 009a00b commit 2ff30da

File tree

3 files changed

+24
-25
lines changed

3 files changed

+24
-25
lines changed

datasets/synthetic_data/README.md

Lines changed: 14 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,23 @@
11
# Synthetic Data
22

3+
_Synthetic Data_ is a generic dataset label for a class of synthetic pathways provided by [PathwayCommons](https://www.pathwaycommons.org/).
4+
Currently, we only use [PANTHER](https://pantherdb.org/) pathways from PantherCommons, specifically enumerated in `./pathways.jsonc`.
5+
6+
This entire workflow can also be done with `uv run snakemake --cores 1` inside this directory, as like any other dataset.
7+
8+
## Workflow
9+
310
## PANTHER Pathway Fetching
411

5-
This dataset has a kind of 'sub'-dataset, which is a separate Snakemake rule
6-
used for generating the pathway files and their associated metadata to be used inside this one.
12+
PANTHER pathways are fetched from a singular OWL file containing a bundled collection of all pathways. Since the OWL file that
13+
PathwayCommons provides is over 10gb, we have a separate Snakemake workflow, located nuder `./panther_pathways`, that trims down the OWL file
14+
to only contain pathways from PANTHER.
715

8-
Located under `./panther_pathways`, it provides TODO.
16+
Inside `scripts/fetch_pathway.py`, we use this intermediately-generated (and cached!) OWL file to individually generate associated OWL and
17+
SIF files for each pathway.
918

10-
## Download New PANTHER Pathways
11-
1. Visit [Pathway Commons](https://www.pathwaycommons.org/).
12-
2. Search for the desired pathway (e.g., "signaling") and filter the results by the **PANTHER pathway** data source.
13-
Example: [Search for "Signaling" filtered by PANTHER pathway](https://apps.pathwaycommons.org/search?datasource=panther&q=Signaling&type=Pathway)
14-
3. Click on the desired pathway and download the **Extended SIF** version of the pathway.
15-
4. In the `raw/pathway-data/` folder, create a new subfolder named after the pathway you downloaded.
16-
5. Move the downloaded Extended SIF file to this new folder (as a `.txt` file). Rename the file to match the subfolder name exactly.
19+
We have a `./util/parse_pc_pathways.py`, which takes a `pathways.txt` provided by PathwayCommons, and allows us to map the
20+
human-readable pathway names in `pathways.jsonc` into [identifiers.org](https://identifiers.org/) identifiers.
1721

1822
## Sources and Targets
1923

@@ -23,10 +27,6 @@ are silico human surfaceomes receptors.
2327
[Targets]( https://guolab.wchscu.cn/AnimalTFDB4//#/), or `Homo_sapiens_TF.tsv`, (see [original paper](https://doi.org/10.1093/nar/gkac907))
2428
are human transcription factors.
2529

26-
## Steps to Generate SPRAS-Compatible Pathways
27-
28-
This entire workflow can also be done with `uv run snakemake --cores 1` inside this directory.
29-
3030
### 1. Process PANTHER Pathways
3131

3232
1. Open `Snakefile` and add the name of any new pathways to the `pathways` entry.
@@ -53,13 +53,3 @@ Each subfolder will include the following three files:
5353
- `<pathway_name>_gs_nodes.txt`
5454
- `<pathway_name>_node_prizes.txt`
5555

56-
# Pilot Data
57-
For the pilot data, use the list `["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"]` in both:
58-
- the list in `combine.py`
59-
- the list in `overlap_analytics.py`
60-
61-
Make sure these pathways in the list are also added `["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"]`to:
62-
- the `pathways` vector in `ProcessPantherPathway.R`
63-
- the list in `panther_spras_formatting.py`
64-
65-
**Once you’ve updated the pathway lists in all relevant scripts, run all the steps above to generate the Pilot dataset.**

datasets/synthetic_data/Snakefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ rule process_tfs:
4646

4747
rule process_panther_pathway:
4848
input:
49-
"raw/pathway-data/{pathway}.txt",
49+
"intermediate/pathway-data/{pathway}.txt",
5050
"raw/human-interactome/table_S3_surfaceome.xlsx",
5151
"raw/human-interactome/Homo_sapiens_TF_Uniprot.tsv"
5252
output:
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# panther_pathways
2+
3+
PathwayCommons provides the multi-GB file `pc-biopax.owl`. We need to extract specific pathways from this file.
4+
PaxTools, instead of streaming this XML file, instead opts to load the entire file into memory. Since this is infesable
5+
in any cheap CI system, we instead opt to make this a separate workflow: it takes `pc-biopax.owl`, along with
6+
all PANTHER pathways (TODO: this can be generalized), and generates a new OWL file that contains all PANTHER pathways.
7+
8+
Then, instead of extracting files from the large OWL file above, we use this smaller OWL file in the `../` dataset
9+
where we then split pathways individually.

0 commit comments

Comments
 (0)