openproblems-bio
diff --git a/‎docs/source/dataset.rst‎
Lines changed: 5 additions & 5 deletions b/‎docs/source/dataset.rst‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎scripts/experiments/run_process_data.sh‎
Lines changed: 16 additions & 0 deletions b/‎scripts/experiments/run_process_data.sh‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎scripts/prior/run_consensus.sh‎
Lines changed: 1 addition & 1 deletion b/‎scripts/prior/run_consensus.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎scripts/run_all.sh‎
Lines changed: 7 additions & 5 deletions b/‎scripts/run_all.sh‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎scripts/run_grn_evaluation.sh‎
Lines changed: 1 addition & 0 deletions b/‎scripts/run_grn_evaluation.sh‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎scripts/run_grn_inference.sh‎
Lines changed: 59 additions & 27 deletions b/‎scripts/run_grn_inference.sh‎
Lines changed: 59 additions & 27 deletions
diff --git a/‎scripts/run_process_data.sh‎
Lines changed: 1 addition & 1 deletion b/‎scripts/run_process_data.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎scripts/sync_resources.sh‎
Lines changed: 1 addition & 1 deletion b/‎scripts/sync_resources.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/control_methods/negative_control/config.vsh.yaml‎
Lines changed: 1 addition & 1 deletion b/‎src/control_methods/negative_control/config.vsh.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/control_methods/pearson_corr/config.vsh.yaml‎
Lines changed: 1 addition & 1 deletion b/‎src/control_methods/pearson_corr/config.vsh.yaml‎
Lines changed: 1 addition & 1 deletion
@@ -1,7 +1,7 @@
 Datasets
 ========
-In this section, we explain how to access datasets without installing geneRNIB. The available datasets include **OPSCA, Nakatake, Replogle, Adamson, Norman, Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience**. 
-It should be noted that three datasets of **Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience** are not added to the manuscript yet.
+Here, we explain how to access datasets without installing geneRNIB. The available datasets include **OPSCA, Nakatake, Replogle, Adamson, Norman, Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience**. 
+It should be noted that three datasets of **Xaira_HCT116, Xaira_HEK293T** and **ParseBioscience** are not added to the initial manuscript yet.
 All datasets provide RNA data, while the `OPSCA` dataset also includes ATAC data. 
 The perturbation signature of these datasets are given below. 
 You need `awscli` to download the datasets. If you don't have it installed, you can download it from [here](https://aws.amazon.com/cli/). You do not need to sign in to download the datasets.
@@ -36,8 +36,8 @@ Downloading the extended datasets
 -----------------------------
 
 Beyond the core datasets, extended datasets include single cell data of large perturbation datasets such as Replogle, Xaira, and Parse bioscience.
-The previous version was subsetted to smaller number of perturbations for computational efficiency. 
-Additionally, pseudobulked versions of all other datasets are available, representing the combined inference and evaluation datasets. 
+The previous version were pseudobulked for computational efficiency. 
+Additionally, full pseudobulked versions of all other datasets are available, representing the combined inference and evaluation datasets. 
 These files are used for the `positive control` method, which incorporates all variations within a dataset.
 
 To download the extended datasets, use:
@@ -59,7 +59,7 @@ We have not provided raw data for a few recent datasets due to very large file s
 
 Downloading the GRN models
 ---------------------------------------------
-To download the GRN models used in geneRNIB so far, run:
+To download the GRN models used in geneRNIB, run:
 
 .. code-block:: bash
 
 
@@ -0,0 +1,16 @@
+#!/bin/bash
+#SBATCH --job-name=experiments
+#SBATCH --output=logs/%j.out
+#SBATCH --error=logs/%j.err
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=2
+#SBATCH --time=20:00:00
+#SBATCH --mem=1000GB
+#SBATCH --partition=cpu
+#SBATCH --mail-type=END,FAIL      
+#SBATCH --mail-user=jalil.nourisa@gmail.com   
+
+
+set -e
+
+python src/stability_analysis/pseudobulk/bulk_vs_sc/script.py
@@ -54,7 +54,7 @@ for dataset in "${datasets[@]}"; do
         --models_dir "$models_dir" \
         --ws_consensus "resources/grn_benchmark/prior/ws_consensus_${dataset}.csv" \
         --tf_all "resources/grn_benchmark/prior/tf_all.csv" \
-        --evaluation_data_sc "resources/grn_benchmark/evaluation_data/${dataset}_sc.h5ad" \
+        --evaluation_data_sc "resources/processed_data/${dataset}_evaluation_sc.h5ad" \
         --models "${models[@]}"
 done
 
@@ -1,21 +1,23 @@
 set -e
 
 datasets=('replogle') #'replogle' 'op' 'nakatake' 'adamson' 'norman' 
-run_local=false # set to true to run locally, false to run on AWS
+run_local=true # set to true to run locally, false to run on AWS
 
-run_grn_inference=true
-run_grn_evaluation=false
+run_grn_inference=false
+run_grn_evaluation=true
 run_download=false
 
+
 for dataset in "${datasets[@]}"; do
+
     if [ "$run_grn_inference" = true ]; then
         echo "Running GRN inference for dataset: $dataset"
         if [ "$run_local" = true ]; then
             echo "Running locally"
         else
             echo "Running on AWS"
         fi
-        bash scripts/run_grn_inference.sh $dataset $run_local
+        bash scripts/run_grn_inference.sh --dataset=$dataset --run_local=$run_local
 
     fi
 
@@ -33,7 +35,7 @@ for dataset in "${datasets[@]}"; do
         fi
 
         echo "Running GRN evaluation for dataset: $dataset"
-        bash scripts/run_grn_evaluation.sh --dataset=$dataset --run_local=$run_local --build_images=false
+        bash scripts/run_grn_evaluation.sh --dataset=$dataset --run_local=$run_local --build_images=false 
     fi
 
     if [ "$run_download" = true ]; then
 
@@ -12,6 +12,7 @@ PREDICTION="none"
 SAVE_DIR="none"
 BUILD_IMAGES=true
 
+
 # Parse arguments
 for arg in "$@"; do
     case $arg in
 
@@ -1,31 +1,54 @@
 #!/bin/bash
 
-# --------------------------
-# Dataset-specific availability:
-# ws_distance: only for [norman, adamson, replogle]
-# scprint: only for [opsca, replogle, norman] (uses different inference data)
-# scenicplus, scglue, granie, figr, celloracle: only for [opsca]
-# --------------------------
 set -e 
+
 # --- Settings ---
-test=false
-DATASET="${1:-replogle}"
-echo "DATASET is: $DATASET"
-RUN_ID="${DATASET}_inference"
-run_local="${2:-false}"
+RUN_TEST=false
 num_workers=10
 apply_tf_methods=true
 layer='lognorm'
+RUN_LOCAL=false
+# Parse arguments
+for arg in "$@"; do
+    case $arg in
+        --dataset=*)
+            DATASET="${arg#*=}"
+            shift
+            ;;
+        --test_run=*)
+            RUN_TEST="${arg#*=}"
+            shift
+            ;;
+        --run_local=*)
+            RUN_LOCAL="${arg#*=}"
+            shift
+            ;;
+        *)
+            echo "Unknown argument: $arg"
+            exit 1
+            ;;
+    esac
+done
+if [ -z "${DATASET:-}" ]; then
+    echo "Error: DATASET must be provided. Use --dataset=<dataset_name>."
+    exit 1
+fi
+
+
+echo "DATASET is: $DATASET"
+RUN_ID="${DATASET}_inference"
 
 # --- Directories ---
-resources_folder=$([ "$test" = true ] && echo "resources_test" || echo "resources")
-if [ "$run_local" = true ]; then
+resources_folder=$([ "$RUN_TEST" = true ] && echo "resources_test" || echo "resources")
+if [ "$RUN_LOCAL" = true ]; then
   resources_dir="./${resources_folder}/"
 else
   resources_dir="s3://openproblems-data/${resources_folder}/grn"
 fi
 
 publish_dir="${resources_dir}/results/${DATASET}"
+
+
 params_dir="./params"
 param_file="${params_dir}/${RUN_ID}.yaml"
 param_local="${params_dir}/${RUN_ID}_param_local.yaml"
@@ -41,7 +64,7 @@ echo "Local param file: $param_local"
 > "$param_local"
 > "$param_file"
 
-if [ "$run_local" = true ]; then
+if [ "$RUN_LOCAL" = true ]; then
   cat >> "$param_local" << HERE
 param_list:
 HERE
@@ -51,17 +74,31 @@ fi
 append_entry() {
   local dataset="$1"
   local methods="$2"
+  local use_train_sc=false
+
+  # check if third argument is non-empty (or truthy)
+  if [ -n "$3" ]; then
+      use_train_sc=true
+  fi
 
   if [[ "$dataset" =~ ^(norman|nakatake|adamson)$ ]]; then
     layer_='X_norm'
   else
-      layer_=$layer
+    layer_="$layer"
+  fi
+
+  if [ "$use_train_sc" = true ]; then
+    rna_file="${resources_dir}/extended_data/${dataset}_train_sc.h5ad"
+    group_id="${dataset}_sc_train"
+  else
+    rna_file="${resources_dir}/grn_benchmark/inference_data/${dataset}_rna.h5ad"
+    group_id="${dataset}"
   fi
-  
+
   cat >> "$param_local" << HERE
-  - id: ${dataset}
+  - id: ${group_id}
     method_ids: $methods
-    rna: ${resources_dir}/grn_benchmark/inference_data/${dataset}_rna.h5ad
+    rna: $rna_file
     rna_all: ${resources_dir}/extended_data/${dataset}_bulk.h5ad
     tf_all: ${resources_dir}/grn_benchmark/prior/tf_all.csv
     layer: $layer_
@@ -76,17 +113,12 @@ HERE
   fi
 }
 
-# --------- COMBINATIONS TO ADD ----------
-if [[ "$DATASET" == "op" ]]; then
-  methods="[pearson_corr, negative_control, positive_control, portia, ppcor, scenic, scprint, grnboost, scenicplus, scglue, granie, figr, celloracle]"
-else
-  methods="[pearson_corr, negative_control, positive_control, scprint]"
-fi
-
-append_entry "$DATASET" "$methods"
+# Example usage:
+append_entry "$DATASET" "[pearson_corr, negative_control, positive_control]"
+# append_entry "$DATASET" "[scprint]" "true"
 
 # --- Final configuration ---
-if [ "$run_local" = true ]; then
+if [ "$RUN_LOCAL" = true ]; then
   cat >> "$param_local" << HERE
 output_state: "state.yaml"
 publish_dir: "$publish_dir"
 
@@ -19,5 +19,5 @@ set -e
 
 # python src/process_data/opsca/script.py 
 # python src/process_data/replogle/script.py  #--run_test  #--run_test
-python src/process_data/xaira/script.py   #--run_test
+python src/process_data/xaira/script.py    #--run_test
 # python src/process_data/parse_bioscience/script.py  #--run_test
@@ -18,4 +18,4 @@ set -e
 # aws s3 sync  s3://openproblems-data/resources/grn/grn_models resources/grn_models --delete 
 # aws s3 sync  resources_test/ s3://openproblems-data/resources_test/grn/  --delete 
 aws s3 sync  resources/grn_benchmark/ s3://openproblems-data/resources/grn/grn_benchmark  --delete 
-# aws s3 sync  resources/extended_data/ s3://openproblems-data/resources/grn/extended_data  --delete 
+aws s3 sync  resources/extended_data/ s3://openproblems-data/resources/grn/extended_data  --delete 
@@ -21,4 +21,4 @@ runners:
   - type: executable
   - type: nextflow
     directives:
-      label: [ midtime, highmem, highcpu ]
+      label: [ midtime, lowmem, highcpu ]
@@ -31,4 +31,4 @@ runners:
   - type: executable
   - type: nextflow
     directives:
-      label: [midtime, veryhighmem, midcpu]
+      label: [midtime, midmem, midcpu]