🎨 Updates

mpielies · mpielies · commit 36b42d08c7e0 · 2025-03-27T23:46:46.000+01:00
- Readme updated
- Revisions notebook cleaned, POLR2A integration domains set to default.
diff --git a/Readme.md b/Readme.md
@@ -16,6 +16,9 @@ This repository contains the files and scripts required to reproduce the results
 ### `configurations`
 - Configuration files (.yaml) required to build different flavours of CLASTER.
 
+### `environment`
+- We provide a predefined environment configuration file to avoid compatibility issues between package versions when running the tutorial.
+
 ### `images`
 - Overview of CLASTER's architecture.
 
@@ -25,8 +28,10 @@ The folder contains the test set inputs for both data modalities, i.e. samples e
 
 ### `scripts`
 
-- [`0_Tutorial.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/0_Tutorial.ipynb): The notebook provides a rapid overview of the most important steps in CLASTER's pipeline, including training and validating the network using the EIR framework. 
-- `I_Data_obtention.ipynb`: This notebook guides the user through the data obtention process, including:
+- **prom_CHiC_preprocessing**: Folder containing the scripts used to obtain promoter-capture HiC cooler files from the raw reads deposited in SRA files.
+
+- [`0_Tutorial.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/0_Tutorial.ipynb): The notebook provides a rapid overview of the most important steps in CLASTER's pipeline, including training and validating the network using the EIR framework. Please have a look at `I_Data_obtention.ipynb` to get more information on how to download publicly available data and convert it into an EIR-friendly format.
+- [`I_Data_obtention.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/I_Data_obtention.ipynb): This notebook guides the user through the data obtention process, including:
     - Data download from publicly available repositories:
         - Inputs: Chromatin landscape (ATAC-seq, H3K4me3, H3K27ac and H3K27me3 in mESCs) and structure (Micro-C maps in mESCs)
         - Outputs: Nascent transcription profiles (EU-seq).
@@ -35,17 +40,30 @@ The folder contains the test set inputs for both data modalities, i.e. samples e
     - Data filtering and preprocessing:
         - Obtain numpy arrays for the inputs.
         - Obtain csv files for the targets.
-- `II_Run_CLASTER.ipynb`: This notebook creates the configuration files required to train and test CLASTER using the EIR framework.
-- `IIb_Run_HyenaDNA_and_Enformer.ipynb`: The notebook contains our adaptations of the code building
+- [`II_Run_CLASTER.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/II_Run_CLASTER.ipynb): This notebook creates the configuration files required to train and test CLASTER using the EIR framework.
+- [`IIb_Run_HyenaDNA_and_Enformer.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/IIb_Run_HyenaDNA_and_Enformer.ipynb): The notebook contains our adaptations of the code building
     - Hyena-DNA (https://github.com/HazyResearch/hyena-dna) in its public colab version.
     - Enformer (https://github.com/lucidrains/enformer-pytorch) in its python implementation. 
 These were used to benchmark CLASTER. It includes:
     - The obtention of sequence embeddings from both model's backbones when loading the pretrained weights. 
     - The addition of a model head on top of the embeddings to match our regression outputs.
     - Code to fine-tune Hyena-DNA's backbone and the added head together.
-- `III_Data_analysis.ipynb`: The notebook contains the functions used to perform the data analysis and create the figures included in the manuscript.
-- `IV_Revisions.ipynb`: Code and analyses during the revisions.
+- [`III_Data_analysis.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/III_Data_analysis.ipynb): The notebook contains the functions used to perform the data analysis and create the figures included in the manuscript.
+- [`IV_Revisions.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/IV_Revisions.ipynb): Code and analyses added during the revisions. These include:
+    - Creation of EIR config files to define CLASTER model variants:
+        - Short context (20kbp).
+        - Different test split.
+        - No H3K27ac.
+        - Different loss functions.
+        - Different last layer activation functions.
+        - Adding promoter-capture Hi-C.
+    - Enhancer-centric perturbational analysis.
+    - Extended perturbations to unveil the learned regulatory logic.
+    - Extended performance metrics.
+    - Data distribution plots.
+    - Predicting RNA-seq and POLR2A ChIP-seq.
+        - Benchmarking _in silico_ enhancer silencing with CRISPR enhancer KO experiments on K562.
 
 ### `targets`
 
-The folder contains the target EU-seq profiles matching the input (test) samples.
+The folder contains target EU-seq profiles matching the input (test) samples.
diff --git a/scripts/IV_Revisions.ipynb b/scripts/IV_Revisions.ipynb
@@ -8462,8 +8462,7 @@
     "\n",
     "for file,content in test_K562_POLR2A_enhancer_centric_yaml_contents.items():\n",
     "  with open(config_paths[18] / file, 'w') as f:\n",
-    "      f.write(content)\n",
-    "\n"
+    "      f.write(content)"
    ]
   },
   {
@@ -12054,7 +12053,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## <center> VI) Extending CLASTER to new cell types: K562 (human) <center>\n",
+    "## <center> VIII) Extending CLASTER to new cell types: K562 (human) <center>\n",
     "\n",
     "Reviewer 1 asked us to benchmark our _in silico_ perturbations with experimental data. Most experimental data on genome-wide enhancer KOs is obtained in K562 cells. We did not find nascent transcription data matching our protocol, and hence decided to predict two widespread transcriptional readouts: RNA-seq and POLR2A ChIP-seq.\n",
     "\n",
@@ -12696,12 +12695,13 @@
     ">   - Samples where either input or output crossed chromosome boundaries (enhancers at the ends of the chromosomes).\n",
     ">- Predict using models trained on K562 (human) data.\n",
     ">- Quantify POLR2A /RNA-seq changes:\n",
-    ">   - For POLR2A: Integrate between 1 kbp upstream and 2 kbp downstream of all genes in predicted window.\n",
+    ">   - For POLR2A: Integrate between 2 kbp upstream and 3 kbp downstream of all genes in predicted window. (-1,2) kbp yielded similar results.\n",
     ">   - For RNA-seq: Integrate inside gene boundaries of all genes in predicted window.\n",
     ">- Downstream analyses:\n",
     ">   - Precision-Recall and ROC curves for the following models:\n",
     ">      - Gene-enhancer distance: $Score = - Distance$\n",
     ">      - RNA and POLR2A models: $Score = abs($ Area difference $)$\n",
+    ">      - Ratio to max models: area difference divided by max area difference found in predicted window.\n",
     ">   - Confusion matrices:\n",
     ">      - Primary target (most affected gene in a single prediction run): True / False\n",
     ">      - Closest gene: True / False\n",
@@ -13451,8 +13451,8 @@
     "    integration_type: str,\n",
     "    window_size: int = 200500,\n",
     "    resolution: int = 1000,\n",
-    "    upstream_bins: int = 1,\n",
-    "    downstream_bins: int = 2,\n",
+    "    upstream_bins: int = 2,\n",
+    "    downstream_bins: int = 3,\n",
     "    save_path: Path = None,\n",
     "    show_plot: bool = True\n",
     ") -> plt.Figure:\n",
@@ -14726,8 +14726,8 @@
     "    ax1.set_xlim(0,200)\n",
     "    fig.show()\n",
     "\n",
-    "    threshold_polr2a = 10\n",
-    "    merged_crispr_df = merged_crispr_df[(merged_crispr_df['baseline_area_polr2a'] > threshold_polr2a)]\n",
+    "    #threshold_polr2a = 10\n",
+    "    #merged_crispr_df = merged_crispr_df[(merged_crispr_df['baseline_area_polr2a'] > threshold_polr2a)]\n",
     "    \n",
     "    # 1. Plot correlation between methods\n",
     "    print(\"- Generating correlation scatter plot...\")\n",
@@ -14813,7 +14813,7 @@
    "source": [
     "**Plotting ground truth Enhancer-Gene pairs**\n",
     "\n",
-    "Here we will plot ground truth Enhancer-Gene pairs in K562 cells, obtained by CRISPR KO of enhancers and measuring the induced gene expression changes. This data was downloaded from the [Engreitz lab's github](https://github.com/EngreitzLab/CRISPR_comparison/tree/main/resources/crispr_data), referenced as a benchmarking dataset in [A. Gschwind et al.](https://doi.org/10.1101/2023.11.09.563812)."
+    "Here we will plot ground truth Enhancer-Gene pairs in K562 cells, obtained by CRISPR KO of enhancers and measuring the induced gene expression changes. This data was downloaded from the [Engreitz lab's github](https://github.com/EngreitzLab/CRISPR_comparison/tree/main/resources/crispr_data), referenced as a benchmarking dataset in [Gschwind et al.](https://doi.org/10.1101/2023.11.09.563812)."
    ]
   },
   {