Skip to content

Commit c1ec044

Browse files
authored
Merge pull request #41 from openproblems-bio/jalil
small structureal changes to metrics
2 parents 1eacd3b + e9066b0 commit c1ec044

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+496
-1588
lines changed

docs/source/dataset.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Datasets
22
========
3-
In this section, we explain how to access datasets without installing geneRNIB. The available datasets include **OPSCA, Nakatake, Replogle, Adamson, Norman, Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience**.
4-
It should be noted that three datasets of **Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience** are not added to the manuscript yet.
3+
Here, we explain how to access datasets without installing geneRNIB. The available datasets include **OPSCA, Nakatake, Replogle, Adamson, Norman, Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience**.
4+
It should be noted that three datasets of **Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience** are not added to the initial manuscript yet.
55
All datasets provide RNA data, while the `OPSCA` dataset also includes ATAC data.
66
The perturbation signature of these datasets are given below.
77
You need `awscli` to download the datasets. If you don't have it installed, you can download it from [here](https://aws.amazon.com/cli/). You do not need to sign in to download the datasets.
@@ -36,8 +36,8 @@ Downloading the extended datasets
3636
-----------------------------
3737

3838
Beyond the core datasets, extended datasets include single cell data of large perturbation datasets such as Replogle, Xaira, and Parse bioscience.
39-
The previous version was subsetted to smaller number of perturbations for computational efficiency.
40-
Additionally, pseudobulked versions of all other datasets are available, representing the combined inference and evaluation datasets.
39+
The previous version were pseudobulked for computational efficiency.
40+
Additionally, full pseudobulked versions of all other datasets are available, representing the combined inference and evaluation datasets.
4141
These files are used for the `positive control` method, which incorporates all variations within a dataset.
4242

4343
To download the extended datasets, use:
@@ -59,7 +59,7 @@ We have not provided raw data for a few recent datasets due to very large file s
5959

6060
Downloading the GRN models
6161
---------------------------------------------
62-
To download the GRN models used in geneRNIB so far, run:
62+
To download the GRN models used in geneRNIB, run:
6363

6464
.. code-block:: bash
6565
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/bin/bash
2+
#SBATCH --job-name=experiments
3+
#SBATCH --output=logs/%j.out
4+
#SBATCH --error=logs/%j.err
5+
#SBATCH --ntasks=1
6+
#SBATCH --cpus-per-task=2
7+
#SBATCH --time=20:00:00
8+
#SBATCH --mem=1000GB
9+
#SBATCH --partition=cpu
10+
#SBATCH --mail-type=END,FAIL
11+
#SBATCH --mail-user=jalil.nourisa@gmail.com
12+
13+
14+
set -e
15+
16+
python src/stability_analysis/pseudobulk/bulk_vs_sc/script.py

scripts/prior/run_consensus.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ for dataset in "${datasets[@]}"; do
5454
--models_dir "$models_dir" \
5555
--ws_consensus "resources/grn_benchmark/prior/ws_consensus_${dataset}.csv" \
5656
--tf_all "resources/grn_benchmark/prior/tf_all.csv" \
57-
--evaluation_data_sc "resources/grn_benchmark/evaluation_data/${dataset}_sc.h5ad" \
57+
--evaluation_data_sc "resources/processed_data/${dataset}_evaluation_sc.h5ad" \
5858
--models "${models[@]}"
5959
done
6060

scripts/run_all.sh

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,23 @@
11
set -e
22

33
datasets=('replogle') #'replogle' 'op' 'nakatake' 'adamson' 'norman'
4-
run_local=false # set to true to run locally, false to run on AWS
4+
run_local=true # set to true to run locally, false to run on AWS
55

6-
run_grn_inference=true
7-
run_grn_evaluation=false
6+
run_grn_inference=false
7+
run_grn_evaluation=true
88
run_download=false
99

10+
1011
for dataset in "${datasets[@]}"; do
12+
1113
if [ "$run_grn_inference" = true ]; then
1214
echo "Running GRN inference for dataset: $dataset"
1315
if [ "$run_local" = true ]; then
1416
echo "Running locally"
1517
else
1618
echo "Running on AWS"
1719
fi
18-
bash scripts/run_grn_inference.sh $dataset $run_local
20+
bash scripts/run_grn_inference.sh --dataset=$dataset --run_local=$run_local
1921

2022
fi
2123

@@ -33,7 +35,7 @@ for dataset in "${datasets[@]}"; do
3335
fi
3436

3537
echo "Running GRN evaluation for dataset: $dataset"
36-
bash scripts/run_grn_evaluation.sh --dataset=$dataset --run_local=$run_local --build_images=false
38+
bash scripts/run_grn_evaluation.sh --dataset=$dataset --run_local=$run_local --build_images=false
3739
fi
3840

3941
if [ "$run_download" = true ]; then

scripts/run_grn_evaluation.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ PREDICTION="none"
1212
SAVE_DIR="none"
1313
BUILD_IMAGES=true
1414

15+
1516
# Parse arguments
1617
for arg in "$@"; do
1718
case $arg in

scripts/run_grn_inference.sh

Lines changed: 59 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,54 @@
11
#!/bin/bash
22

3-
# --------------------------
4-
# Dataset-specific availability:
5-
# ws_distance: only for [norman, adamson, replogle]
6-
# scprint: only for [opsca, replogle, norman] (uses different inference data)
7-
# scenicplus, scglue, granie, figr, celloracle: only for [opsca]
8-
# --------------------------
93
set -e
4+
105
# --- Settings ---
11-
test=false
12-
DATASET="${1:-replogle}"
13-
echo "DATASET is: $DATASET"
14-
RUN_ID="${DATASET}_inference"
15-
run_local="${2:-false}"
6+
RUN_TEST=false
167
num_workers=10
178
apply_tf_methods=true
189
layer='lognorm'
10+
RUN_LOCAL=false
11+
# Parse arguments
12+
for arg in "$@"; do
13+
case $arg in
14+
--dataset=*)
15+
DATASET="${arg#*=}"
16+
shift
17+
;;
18+
--test_run=*)
19+
RUN_TEST="${arg#*=}"
20+
shift
21+
;;
22+
--run_local=*)
23+
RUN_LOCAL="${arg#*=}"
24+
shift
25+
;;
26+
*)
27+
echo "Unknown argument: $arg"
28+
exit 1
29+
;;
30+
esac
31+
done
32+
if [ -z "${DATASET:-}" ]; then
33+
echo "Error: DATASET must be provided. Use --dataset=<dataset_name>."
34+
exit 1
35+
fi
36+
37+
38+
echo "DATASET is: $DATASET"
39+
RUN_ID="${DATASET}_inference"
1940

2041
# --- Directories ---
21-
resources_folder=$([ "$test" = true ] && echo "resources_test" || echo "resources")
22-
if [ "$run_local" = true ]; then
42+
resources_folder=$([ "$RUN_TEST" = true ] && echo "resources_test" || echo "resources")
43+
if [ "$RUN_LOCAL" = true ]; then
2344
resources_dir="./${resources_folder}/"
2445
else
2546
resources_dir="s3://openproblems-data/${resources_folder}/grn"
2647
fi
2748

2849
publish_dir="${resources_dir}/results/${DATASET}"
50+
51+
2952
params_dir="./params"
3053
param_file="${params_dir}/${RUN_ID}.yaml"
3154
param_local="${params_dir}/${RUN_ID}_param_local.yaml"
@@ -41,7 +64,7 @@ echo "Local param file: $param_local"
4164
> "$param_local"
4265
> "$param_file"
4366

44-
if [ "$run_local" = true ]; then
67+
if [ "$RUN_LOCAL" = true ]; then
4568
cat >> "$param_local" << HERE
4669
param_list:
4770
HERE
@@ -51,17 +74,31 @@ fi
5174
append_entry() {
5275
local dataset="$1"
5376
local methods="$2"
77+
local use_train_sc=false
78+
79+
# check if third argument is non-empty (or truthy)
80+
if [ -n "$3" ]; then
81+
use_train_sc=true
82+
fi
5483

5584
if [[ "$dataset" =~ ^(norman|nakatake|adamson)$ ]]; then
5685
layer_='X_norm'
5786
else
58-
layer_=$layer
87+
layer_="$layer"
88+
fi
89+
90+
if [ "$use_train_sc" = true ]; then
91+
rna_file="${resources_dir}/extended_data/${dataset}_train_sc.h5ad"
92+
group_id="${dataset}_sc_train"
93+
else
94+
rna_file="${resources_dir}/grn_benchmark/inference_data/${dataset}_rna.h5ad"
95+
group_id="${dataset}"
5996
fi
60-
97+
6198
cat >> "$param_local" << HERE
62-
- id: ${dataset}
99+
- id: ${group_id}
63100
method_ids: $methods
64-
rna: ${resources_dir}/grn_benchmark/inference_data/${dataset}_rna.h5ad
101+
rna: $rna_file
65102
rna_all: ${resources_dir}/extended_data/${dataset}_bulk.h5ad
66103
tf_all: ${resources_dir}/grn_benchmark/prior/tf_all.csv
67104
layer: $layer_
@@ -76,17 +113,12 @@ HERE
76113
fi
77114
}
78115

79-
# --------- COMBINATIONS TO ADD ----------
80-
if [[ "$DATASET" == "op" ]]; then
81-
methods="[pearson_corr, negative_control, positive_control, portia, ppcor, scenic, scprint, grnboost, scenicplus, scglue, granie, figr, celloracle]"
82-
else
83-
methods="[pearson_corr, negative_control, positive_control, scprint]"
84-
fi
85-
86-
append_entry "$DATASET" "$methods"
116+
# Example usage:
117+
append_entry "$DATASET" "[pearson_corr, negative_control, positive_control]"
118+
# append_entry "$DATASET" "[scprint]" "true"
87119

88120
# --- Final configuration ---
89-
if [ "$run_local" = true ]; then
121+
if [ "$RUN_LOCAL" = true ]; then
90122
cat >> "$param_local" << HERE
91123
output_state: "state.yaml"
92124
publish_dir: "$publish_dir"

scripts/run_process_data.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,5 @@ set -e
1919

2020
# python src/process_data/opsca/script.py
2121
# python src/process_data/replogle/script.py #--run_test #--run_test
22-
python src/process_data/xaira/script.py #--run_test
22+
python src/process_data/xaira/script.py #--run_test
2323
# python src/process_data/parse_bioscience/script.py #--run_test

scripts/sync_resources.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ set -e
1818
# aws s3 sync s3://openproblems-data/resources/grn/grn_models resources/grn_models --delete
1919
# aws s3 sync resources_test/ s3://openproblems-data/resources_test/grn/ --delete
2020
aws s3 sync resources/grn_benchmark/ s3://openproblems-data/resources/grn/grn_benchmark --delete
21-
# aws s3 sync resources/extended_data/ s3://openproblems-data/resources/grn/extended_data --delete
21+
aws s3 sync resources/extended_data/ s3://openproblems-data/resources/grn/extended_data --delete

src/control_methods/negative_control/config.vsh.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ runners:
2121
- type: executable
2222
- type: nextflow
2323
directives:
24-
label: [ midtime, highmem, highcpu ]
24+
label: [ midtime, lowmem, highcpu ]

src/control_methods/pearson_corr/config.vsh.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,4 @@ runners:
3131
- type: executable
3232
- type: nextflow
3333
directives:
34-
label: [midtime, veryhighmem, midcpu]
34+
label: [midtime, midmem, midcpu]

0 commit comments

Comments
 (0)