11# Synthetic Data
22
3+ _ Synthetic Data_ is a generic dataset label for a class of synthetic pathways provided by [ PathwayCommons] ( https://www.pathwaycommons.org/ ) .
4+ Currently, we only use [ PANTHER] ( https://pantherdb.org/ ) pathways from PantherCommons, specifically enumerated in ` ./pathways.jsonc ` .
5+
6+ This entire workflow can also be done with ` uv run snakemake --cores 1 ` inside this directory, as like any other dataset.
7+
8+ ## Workflow
9+
310## PANTHER Pathway Fetching
411
5- This dataset has a kind of 'sub'-dataset, which is a separate Snakemake rule
6- used for generating the pathway files and their associated metadata to be used inside this one.
12+ PANTHER pathways are fetched from a singular OWL file containing a bundled collection of all pathways. Since the OWL file that
13+ PathwayCommons provides is over 10gb, we have a separate Snakemake workflow, located nuder ` ./panther_pathways ` , that trims down the OWL file
14+ to only contain pathways from PANTHER.
715
8- Located under ` ./panther_pathways ` , it provides TODO.
16+ Inside ` scripts/fetch_pathway.py ` , we use this intermediately-generated (and cached!) OWL file to individually generate associated OWL and
17+ SIF files for each pathway.
918
10- ## Download New PANTHER Pathways
11- 1 . Visit [ Pathway Commons] ( https://www.pathwaycommons.org/ ) .
12- 2 . Search for the desired pathway (e.g., "signaling") and filter the results by the ** PANTHER pathway** data source.
13- Example: [ Search for "Signaling" filtered by PANTHER pathway] ( https://apps.pathwaycommons.org/search?datasource=panther&q=Signaling&type=Pathway )
14- 3 . Click on the desired pathway and download the ** Extended SIF** version of the pathway.
15- 4 . In the ` raw/pathway-data/ ` folder, create a new subfolder named after the pathway you downloaded.
16- 5 . Move the downloaded Extended SIF file to this new folder (as a ` .txt ` file). Rename the file to match the subfolder name exactly.
19+ We have a ` ./util/parse_pc_pathways.py ` , which takes a ` pathways.txt ` provided by PathwayCommons, and allows us to map the
20+ human-readable pathway names in ` pathways.jsonc ` into [ identifiers.org] ( https://identifiers.org/ ) identifiers.
1721
1822## Sources and Targets
1923
@@ -23,10 +27,6 @@ are silico human surfaceomes receptors.
2327[ Targets] ( https://guolab.wchscu.cn/AnimalTFDB4//#/ ) , or ` Homo_sapiens_TF.tsv ` , (see [ original paper] ( https://doi.org/10.1093/nar/gkac907 ) )
2428are human transcription factors.
2529
26- ## Steps to Generate SPRAS-Compatible Pathways
27-
28- This entire workflow can also be done with ` uv run snakemake --cores 1 ` inside this directory.
29-
3030### 1. Process PANTHER Pathways
3131
32321 . Open ` Snakefile ` and add the name of any new pathways to the ` pathways ` entry.
@@ -53,13 +53,3 @@ Each subfolder will include the following three files:
5353- ` <pathway_name>_gs_nodes.txt `
5454- ` <pathway_name>_node_prizes.txt `
5555
56- # Pilot Data
57- For the pilot data, use the list ` ["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"] ` in both:
58- - the list in ` combine.py `
59- - the list in ` overlap_analytics.py `
60-
61- Make sure these pathways in the list are also added ` ["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"] ` to:
62- - the ` pathways ` vector in ` ProcessPantherPathway.R `
63- - the list in ` panther_spras_formatting.py `
64-
65- ** Once you’ve updated the pathway lists in all relevant scripts, run all the steps above to generate the Pilot dataset.**
0 commit comments