Skip to content

Commit a7ed939

Browse files
committed
feat: unify workflow
1 parent b4e37dd commit a7ed939

File tree

10 files changed

+56
-87
lines changed

10 files changed

+56
-87
lines changed

datasets/synthetic_data/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
/intermediate
22
/processed
33
/raw
4+
/thresholded

datasets/synthetic_data/README.md

Lines changed: 5 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,20 @@ This entire workflow can also be done with `uv run snakemake --cores 1` inside t
77

88
## Workflow
99

10+
The workflow follows these steps in order:
11+
1012
## PANTHER Pathway Fetching
1113

1214
PANTHER pathways are fetched from a singular OWL file containing a bundled collection of all pathways. Since the OWL file that
13-
PathwayCommons provides is over 10gb, we have a separate Snakemake workflow, located nuder `./panther_pathways`, that trims down the OWL file
15+
PathwayCommons provides is over 10gb, we have a separate Snakemake workflow, located under `./panther_pathways`, that trims down the OWL file
1416
to only contain pathways from PANTHER.
1517

1618
Inside `scripts/fetch_pathway.py`, we use this intermediately-generated (and cached!) OWL file to individually generate associated OWL and
1719
SIF files for each pathway.
1820

1921
We have a `./util/parse_pc_pathways.py`, which takes a `pathways.txt` provided by PathwayCommons, and allows us to map the
20-
human-readable pathway names in `pathways.jsonc` into [identifiers.org](https://identifiers.org/) identifiers.
22+
human-readable pathway names into [identifiers.org](https://identifiers.org/) identifiers, which we later trim down
23+
with our provided list of pathway names in `pathways.jsonc` using `list_curated_pathways.py`.
2124

2225
## Sources and Targets
2326

@@ -26,30 +29,3 @@ are silico human surfaceomes receptors.
2629

2730
[Targets]( https://guolab.wchscu.cn/AnimalTFDB4//#/), or `Homo_sapiens_TF.tsv`, (see [original paper](https://doi.org/10.1093/nar/gkac907))
2831
are human transcription factors.
29-
30-
### 1. Process PANTHER Pathways
31-
32-
1. Open `Snakefile` and add the name of any new pathways to the `pathways` entry.
33-
2. Run the command:
34-
```sh
35-
uv run scripts/process_panther_pathway.py <pathway>
36-
```
37-
3. This will create five new files in the respective `pathway` subfolder of the `pathway-data/` directory:
38-
- `edges.txt`
39-
- `nodes.txt`
40-
- `prizes-100.txt`
41-
- `sources.txt`
42-
- `targets.txt`
43-
44-
### 2. Convert Pathways to SPRAS-Compatible Format
45-
1. In `panther_spras_formatting.py`, add the name of any new pathways to the `pathway_dirs` list on **line 8**.
46-
2. From the synthetic_data/ directory, run the command:
47-
```
48-
python scripts/panther_spras_formatting.py
49-
```
50-
3. This will create a new folder named `spras-compatible-pathway-data`, containing subfolders for each PANTHER pathway in SPRAS-compatible format.
51-
Each subfolder will include the following three files:
52-
- `<pathway_name>_gs_edges.txt`
53-
- `<pathway_name>_gs_nodes.txt`
54-
- `<pathway_name>_node_prizes.txt`
55-

datasets/synthetic_data/Snakefile

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ rule all:
1616

1717
produce_fetch_rules({
1818
"raw/9606.protein.links.full.v12.0.txt": FetchConfig(["STRING", "9606", "9606.protein.links.full.txt.gz"], uncompress=True),
19+
"raw/9606.protein.aliases.txt": FetchConfig(["STRING", "9606", "9606.protein.aliases.txt.gz"], uncompress=True),
1920
"raw/human-interactome/table_S3_surfaceome.xlsx": ["Surfaceome", "table_S3_surfaceome.xlsx"],
2021
"raw/human-interactome/Homo_sapiens_TF.tsv": ["TranscriptionFactors", "Homo_sapiens_TF.tsv"],
2122
"raw/human-interactome/HUMAN_9606_idmapping_selected.tsv": FetchConfig(["UniProt", "9606", "HUMAN_9606_idmapping_selected.tab.gz"], uncompress=True),
@@ -26,12 +27,10 @@ produce_fetch_rules({
2627

2728
rule interactome:
2829
input:
30+
"raw/human-interactome/HUMAN_9606_idmapping_selected.tsv",
2931
"raw/9606.protein.links.full.v12.0.txt",
3032
"raw/9606.protein.aliases.txt"
31-
output:
32-
"processed/proteins_missing_aliases.csv",
33-
"processed/removed_edges.txt",
34-
"processed/interactome.tsv"
33+
output: "processed/interactome.tsv"
3534
shell:
3635
"uv run scripts/interactome.py"
3736

@@ -46,7 +45,7 @@ rule process_tfs:
4645

4746
rule process_panther_pathway:
4847
input:
49-
"intermediate/pathway-data/{pathway}.txt",
48+
"intermediate/pathway-pc-data/{pathway}.sif",
5049
"raw/human-interactome/table_S3_surfaceome.xlsx",
5150
"raw/human-interactome/Homo_sapiens_TF_Uniprot.tsv"
5251
output:
@@ -56,7 +55,7 @@ rule process_panther_pathway:
5655
"intermediate/{pathway}/targets.txt",
5756
"intermediate/{pathway}/prizes.txt"
5857
shell:
59-
"uv run scripts/process_panther_pathway.py {wildcards.pathway}"
58+
'uv run scripts/process_panther_pathway.py "{wildcards.pathway}"'
6059

6160
rule make_spras_compatible:
6261
input:
@@ -70,7 +69,7 @@ rule make_spras_compatible:
7069
"processed/{pathway}/{pathway}_gs_edges.txt",
7170
"processed/{pathway}/{pathway}_gs_nodes.txt"
7271
shell:
73-
"uv run scripts/panther_spras_formatting.py {wildcards.pathway}"
72+
'uv run scripts/panther_spras_formatting.py "{wildcards.pathway}"'
7473

7574
rule threshold:
7675
input:
@@ -80,23 +79,25 @@ rule threshold:
8079
expand("thresholded/{threshold}/{{pathway}}/interactome.txt", threshold=thresholds),
8180
expand("thresholded/{threshold}/{{pathway}}/gold_standard_edges.txt", threshold=thresholds)
8281
shell:
83-
"uv run scripts/sampling.py {wildcards.pathway}"
82+
'uv run scripts/sampling.py "{wildcards.pathway}"'
8483

8584
rule make_pathway_map:
8685
input:
8786
"raw/pathways.txt"
8887
output:
89-
"processed/pathway_id_mapping.tsv"
88+
"intermediate/curated_pathways_id_mapping.json"
9089
shell:
9190
"uv run scripts/list_curated_pathways.py"
9291

93-
for pathway in pathways:
94-
rule:
95-
input:
96-
"processed/pathway_id_mapping.tsv",
97-
"raw/pc-panther-biopax.owl"
98-
output:
99-
"intermediate/pathway-data/{pathway}.owl",
100-
"intermediate/pathway-data/{pathway}.sif"
101-
shell:
102-
f"uv run scripts/fetch_pathway.py {pathway}"
92+
rule process_pathways:
93+
input:
94+
"intermediate/curated_pathways_id_mapping.json",
95+
"raw/pc-panther-biopax.owl"
96+
params:
97+
# A little trick from https://stackoverflow.com/a/71327709/7589775
98+
pathway=lambda wildcards: wildcards.get("pathway")
99+
output:
100+
"intermediate/pathway-pc-data/{pathway}.owl",
101+
"intermediate/pathway-pc-data/{pathway}.sif"
102+
shell:
103+
'uv run scripts/fetch_pathway.py "{params.pathway}"'

datasets/synthetic_data/pathways.jsonc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,8 @@
1717
"Hedgehog signaling pathway",
1818
"FGF signaling pathway",
1919
"FAS signaling pathway",
20-
// This is actually the Endothelin signaling pathway.
2120
// TODO: report to PathwayCommons: see https://apps.pathwaycommons.org/pathways?uri=https%3A%2F%2Fidentifiers.org%2Fpanther.pathway%3AP00019.
22-
"untitled",
21+
// We want to add the Endothelin signaling pathway, but it is currently labelled under "untitled."
2322
"EGF receptor signaling pathway",
2423
"Cadherin signaling pathway",
2524
"Apoptosis signaling pathway",

datasets/synthetic_data/scripts/fetch_pathway.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import argparse
2+
import json
23
from pathlib import Path
34

4-
import pandas
55
from paxtools.fetch import fetch
66
from paxtools.sif import toSIF
77

@@ -18,10 +18,10 @@ def parser():
1818

1919
def main():
2020
args = parser().parse_args()
21-
curated_pathways_df = pandas.read_csv(synthetic_directory / "intermediate" / "curated_pathways.tsv", sep="\t")
22-
associated_id = curated_pathways_df.loc[curated_pathways_df["Name"] == args.pathway_name].reset_index(drop=True).loc[0]["ID"]
21+
curated_pathways_df = json.loads((synthetic_directory / "intermediate" / "curated_pathways_id_mapping.json").read_text())
22+
associated_id = curated_pathways_df[args.pathway_name]
2323

24-
pathway_data_dir = synthetic_directory / "intermediate" / "pathway-data"
24+
pathway_data_dir = synthetic_directory / "intermediate" / "pathway-pc-data"
2525
pathway_data_dir.mkdir(exist_ok=True, parents=True)
2626

2727
fetch(

datasets/synthetic_data/scripts/list_curated_pathways.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import json
12
from pathlib import Path
23
from jsonc_parser.parser import JsoncParser
34
import pandas
@@ -21,10 +22,7 @@ def main():
2122
if selected_pathways_count != 1:
2223
raise RuntimeError(f"{pathway} references {selected_pathways_count} pathways, when we need to uniquely get one!")
2324
pathway_mapping[pathway] = selected_pathways["PATHWAY_URI"].loc[0]
24-
curated_pathway_df = pandas.DataFrame(pathway_mapping.items())
25-
curated_pathway_df.columns = ["Name", "ID"]
26-
(synthetic_directory / "intermediate").mkdir(exist_ok=True)
27-
curated_pathway_df.to_csv(synthetic_directory / "intermediate" / "curated_pathways.tsv", index=False, sep="\t")
25+
(synthetic_directory / "intermediate" / "curated_pathways_id_mapping.json").write_text(json.dumps(pathway_mapping, indent=4))
2826

2927

3028
if __name__ == "__main__":

datasets/synthetic_data/scripts/panther_spras_formatting.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
import pandas as pd
22
from pathlib import Path
3-
from .util.parser import parser
3+
from datasets.synthetic_data.scripts.util.parser import parser
44

5-
current_directory = Path(__file__).parent.resolve()
5+
synthetic_directory = Path(__file__).parent.parent.resolve()
66

7-
spras_compatible_dir = Path(current_directory, "..", "processed")
8-
directory = Path(current_directory, "..", "intermediate")
7+
spras_compatible_dir = synthetic_directory / "processed"
8+
directory = synthetic_directory / "intermediate"
99

1010
directed = [
1111
"controls-state-change-of",

datasets/synthetic_data/scripts/process_panther_pathway.py

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
1-
import argparse
21
import io
32
import pandas as pd
43
from pathlib import Path
54

6-
current_directory = Path(__file__).parent.resolve()
5+
from datasets.synthetic_data.scripts.util.parser import parser
76

8-
data_directory = current_directory / ".." / "raw" / "pathway-data"
9-
interactome_folder = current_directory / ".." / "raw" / "human-interactome"
7+
synthetic_directory = Path(__file__).parent.parent.resolve()
8+
9+
data_directory = synthetic_directory / "intermediate" / "pathway-pc-data"
10+
interactome_folder = synthetic_directory / "raw" / "human-interactome"
1011

1112

1213
def process_pathway(file: Path, folder: Path):
@@ -65,18 +66,9 @@ def process_pathway(file: Path, folder: Path):
6566
scores["active"] = "true"
6667
scores.to_csv(folder / "prizes.txt", sep="\t", index=False)
6768

68-
69-
def parser():
70-
parser = argparse.ArgumentParser(prog="PANTHER pathway parser")
71-
72-
parser.add_argument("pathway", choices=[file.stem for file in data_directory.iterdir()])
73-
74-
return parser
75-
76-
7769
if __name__ == "__main__":
7870
pathway = parser().parse_args().pathway
79-
pathway_file = data_directory / Path(pathway).with_suffix(".txt")
80-
intermediate_folder = current_directory / ".." / "intermediate" / pathway
71+
pathway_file = data_directory / Path(pathway).with_suffix(".sif")
72+
intermediate_folder = synthetic_directory / "intermediate" / pathway
8173
intermediate_folder.mkdir(parents=True, exist_ok=True)
8274
process_pathway(pathway_file, intermediate_folder)

datasets/synthetic_data/scripts/sampling.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
from typing import OrderedDict, NamedTuple
55
from tools.sample import attempt_sample
66
from tools.trim import trim_data_file
7-
from .util.parser import parser
7+
from datasets.synthetic_data.scripts.util.parser import parser
88

9-
current_directory = Path(__file__).parent.resolve()
9+
synthetic_directory = Path(__file__).parent.parent.resolve()
1010

1111

1212
# From SPRAS. TODO: import once SPRAS uses pixi
@@ -22,7 +22,7 @@ def convert_undirected_to_directed(df: pandas.DataFrame) -> pandas.DataFrame:
2222

2323
def count_weights() -> OrderedDict[int, int]:
2424
"""Returns an ordered map (lowest to highest weight) from the weight to the number of elements the weight has"""
25-
weight_counts = pandas.read_csv(current_directory / ".." / "processed" / "weight-counts.tsv", sep="\t")
25+
weight_counts = pandas.read_csv(synthetic_directory / "processed" / "weight-counts.tsv", sep="\t")
2626
return collections.OrderedDict(sorted({int(k * 1000): int(v) for k, v in dict(weight_counts.values).items()}.items()))
2727

2828

@@ -32,7 +32,7 @@ def read_pathway(pathway_name: str) -> pandas.DataFrame:
3232
with columns Interactor1 -> Interactor2.
3333
"""
3434
pathway_df = pandas.read_csv(
35-
current_directory / ".." / "processed" / pathway_name / f"{pathway_name}_gs_edges.txt",
35+
synthetic_directory / "processed" / pathway_name / f"{pathway_name}_gs_edges.txt",
3636
sep="\t",
3737
names=["Interactor1", "Interactor2", "Weight", "Direction"],
3838
)
@@ -48,7 +48,7 @@ class SourcesTargets(NamedTuple):
4848

4949
def get_node_data(pathway_name: str) -> pandas.DataFrame:
5050
return pandas.read_csv(
51-
current_directory / ".." / "processed" / pathway_name / f"{pathway_name}_node_prizes.txt", sep="\t", usecols=["NODEID", "sources", "targets"]
51+
synthetic_directory / "processed" / pathway_name / f"{pathway_name}_node_prizes.txt", sep="\t", usecols=["NODEID", "sources", "targets"]
5252
)
5353

5454

@@ -66,7 +66,7 @@ def main():
6666
pathway_name = parser().parse_args().pathway
6767
print("Reading interactome...")
6868
interactome_df = pandas.read_csv(
69-
current_directory / ".." / "processed" / "interactome.tsv",
69+
synthetic_directory / "processed" / "interactome.tsv",
7070
header=None,
7171
sep="\t",
7272
names=["Interactor1", "Interactor2", "Weight", "Direction"],
@@ -83,7 +83,7 @@ def main():
8383

8484
# TODO: isolate percentage constant (this currently builds up 0%, 10%, ..., 100%)
8585
for percentage in map(lambda x: (x + 1) / 10, range(10)):
86-
output_directory = current_directory / ".." / "thresholded" / str(percentage) / pathway_name
86+
output_directory = synthetic_directory / "thresholded" / str(percentage) / pathway_name
8787
output_interactome = output_directory / "interactome.txt"
8888
output_gold_standard = output_directory / "gold_standard_edges.txt"
8989

@@ -107,7 +107,7 @@ def main():
107107
print(f"Attempt number {attempt_number}")
108108

109109
# We're done sampling:
110-
(output_directory / "attempt-number.txt").write_text(attempt_number)
110+
(output_directory / "attempt-number.txt").write_text(str(attempt_number))
111111
# we need to trim our data file as well.
112112
trim_data_file(data_df=node_data_df, gold_standard_df=pathway_df).to_csv(output_directory / "node_prizes.tsv", sep="\t", index=False)
113113

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
import argparse
22
from pathlib import Path
33

4-
scripts_directory = Path(__file__).parent.resolve()
4+
from jsonc_parser.parser import JsoncParser
5+
6+
synthetic_directory = Path(__file__).parent.parent.parent.resolve()
57

68

79
def parser():
810
parser = argparse.ArgumentParser(prog="PANTHER pathway parser")
911

10-
parser.add_argument("pathway", choices=[file.stem for file in (scripts_directory / ".." / "raw" / "pathway-data").iterdir()])
12+
parser.add_argument("pathway", choices=JsoncParser.parse_file(synthetic_directory / "pathways.jsonc"))
1113

1214
return parser

0 commit comments

Comments
 (0)