Skip to content

Commit 36b42d0

Browse files
committed
🎨 Updates
- Readme updated - Revisions notebook cleaned, POLR2A integration domains set to default.
1 parent b194860 commit 36b42d0

File tree

2 files changed

+34
-16
lines changed

2 files changed

+34
-16
lines changed

Readme.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ This repository contains the files and scripts required to reproduce the results
1616
### `configurations`
1717
- Configuration files (.yaml) required to build different flavours of CLASTER.
1818

19+
### `environment`
20+
- We provide a predefined environment configuration file to avoid compatibility issues between package versions when running the tutorial.
21+
1922
### `images`
2023
- Overview of CLASTER's architecture.
2124

@@ -25,8 +28,10 @@ The folder contains the test set inputs for both data modalities, i.e. samples e
2528

2629
### `scripts`
2730

28-
- [`0_Tutorial.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/0_Tutorial.ipynb): The notebook provides a rapid overview of the most important steps in CLASTER's pipeline, including training and validating the network using the EIR framework.
29-
- `I_Data_obtention.ipynb`: This notebook guides the user through the data obtention process, including:
31+
- **prom_CHiC_preprocessing**: Folder containing the scripts used to obtain promoter-capture HiC cooler files from the raw reads deposited in SRA files.
32+
33+
- [`0_Tutorial.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/0_Tutorial.ipynb): The notebook provides a rapid overview of the most important steps in CLASTER's pipeline, including training and validating the network using the EIR framework. Please have a look at `I_Data_obtention.ipynb` to get more information on how to download publicly available data and convert it into an EIR-friendly format.
34+
- [`I_Data_obtention.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/I_Data_obtention.ipynb): This notebook guides the user through the data obtention process, including:
3035
- Data download from publicly available repositories:
3136
- Inputs: Chromatin landscape (ATAC-seq, H3K4me3, H3K27ac and H3K27me3 in mESCs) and structure (Micro-C maps in mESCs)
3237
- Outputs: Nascent transcription profiles (EU-seq).
@@ -35,17 +40,30 @@ The folder contains the test set inputs for both data modalities, i.e. samples e
3540
- Data filtering and preprocessing:
3641
- Obtain numpy arrays for the inputs.
3742
- Obtain csv files for the targets.
38-
- `II_Run_CLASTER.ipynb`: This notebook creates the configuration files required to train and test CLASTER using the EIR framework.
39-
- `IIb_Run_HyenaDNA_and_Enformer.ipynb`: The notebook contains our adaptations of the code building
43+
- [`II_Run_CLASTER.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/II_Run_CLASTER.ipynb): This notebook creates the configuration files required to train and test CLASTER using the EIR framework.
44+
- [`IIb_Run_HyenaDNA_and_Enformer.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/IIb_Run_HyenaDNA_and_Enformer.ipynb): The notebook contains our adaptations of the code building
4045
- Hyena-DNA (https://github.com/HazyResearch/hyena-dna) in its public colab version.
4146
- Enformer (https://github.com/lucidrains/enformer-pytorch) in its python implementation.
4247
These were used to benchmark CLASTER. It includes:
4348
- The obtention of sequence embeddings from both model's backbones when loading the pretrained weights.
4449
- The addition of a model head on top of the embeddings to match our regression outputs.
4550
- Code to fine-tune Hyena-DNA's backbone and the added head together.
46-
- `III_Data_analysis.ipynb`: The notebook contains the functions used to perform the data analysis and create the figures included in the manuscript.
47-
- `IV_Revisions.ipynb`: Code and analyses during the revisions.
51+
- [`III_Data_analysis.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/III_Data_analysis.ipynb): The notebook contains the functions used to perform the data analysis and create the figures included in the manuscript.
52+
- [`IV_Revisions.ipynb`](https://github.com/RasmussenLab/CLASTER/blob/master/scripts/IV_Revisions.ipynb): Code and analyses added during the revisions. These include:
53+
- Creation of EIR config files to define CLASTER model variants:
54+
- Short context (20kbp).
55+
- Different test split.
56+
- No H3K27ac.
57+
- Different loss functions.
58+
- Different last layer activation functions.
59+
- Adding promoter-capture Hi-C.
60+
- Enhancer-centric perturbational analysis.
61+
- Extended perturbations to unveil the learned regulatory logic.
62+
- Extended performance metrics.
63+
- Data distribution plots.
64+
- Predicting RNA-seq and POLR2A ChIP-seq.
65+
- Benchmarking _in silico_ enhancer silencing with CRISPR enhancer KO experiments on K562.
4866

4967
### `targets`
5068

51-
The folder contains the target EU-seq profiles matching the input (test) samples.
69+
The folder contains target EU-seq profiles matching the input (test) samples.

scripts/IV_Revisions.ipynb

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8462,8 +8462,7 @@
84628462
"\n",
84638463
"for file,content in test_K562_POLR2A_enhancer_centric_yaml_contents.items():\n",
84648464
" with open(config_paths[18] / file, 'w') as f:\n",
8465-
" f.write(content)\n",
8466-
"\n"
8465+
" f.write(content)"
84678466
]
84688467
},
84698468
{
@@ -12054,7 +12053,7 @@
1205412053
"cell_type": "markdown",
1205512054
"metadata": {},
1205612055
"source": [
12057-
"## <center> VI) Extending CLASTER to new cell types: K562 (human) <center>\n",
12056+
"## <center> VIII) Extending CLASTER to new cell types: K562 (human) <center>\n",
1205812057
"\n",
1205912058
"Reviewer 1 asked us to benchmark our _in silico_ perturbations with experimental data. Most experimental data on genome-wide enhancer KOs is obtained in K562 cells. We did not find nascent transcription data matching our protocol, and hence decided to predict two widespread transcriptional readouts: RNA-seq and POLR2A ChIP-seq.\n",
1206012059
"\n",
@@ -12696,12 +12695,13 @@
1269612695
"> - Samples where either input or output crossed chromosome boundaries (enhancers at the ends of the chromosomes).\n",
1269712696
">- Predict using models trained on K562 (human) data.\n",
1269812697
">- Quantify POLR2A /RNA-seq changes:\n",
12699-
"> - For POLR2A: Integrate between 1 kbp upstream and 2 kbp downstream of all genes in predicted window.\n",
12698+
"> - For POLR2A: Integrate between 2 kbp upstream and 3 kbp downstream of all genes in predicted window. (-1,2) kbp yielded similar results.\n",
1270012699
"> - For RNA-seq: Integrate inside gene boundaries of all genes in predicted window.\n",
1270112700
">- Downstream analyses:\n",
1270212701
"> - Precision-Recall and ROC curves for the following models:\n",
1270312702
"> - Gene-enhancer distance: $Score = - Distance$\n",
1270412703
"> - RNA and POLR2A models: $Score = abs($ Area difference $)$\n",
12704+
"> - Ratio to max models: area difference divided by max area difference found in predicted window.\n",
1270512705
"> - Confusion matrices:\n",
1270612706
"> - Primary target (most affected gene in a single prediction run): True / False\n",
1270712707
"> - Closest gene: True / False\n",
@@ -13451,8 +13451,8 @@
1345113451
" integration_type: str,\n",
1345213452
" window_size: int = 200500,\n",
1345313453
" resolution: int = 1000,\n",
13454-
" upstream_bins: int = 1,\n",
13455-
" downstream_bins: int = 2,\n",
13454+
" upstream_bins: int = 2,\n",
13455+
" downstream_bins: int = 3,\n",
1345613456
" save_path: Path = None,\n",
1345713457
" show_plot: bool = True\n",
1345813458
") -> plt.Figure:\n",
@@ -14726,8 +14726,8 @@
1472614726
" ax1.set_xlim(0,200)\n",
1472714727
" fig.show()\n",
1472814728
"\n",
14729-
" threshold_polr2a = 10\n",
14730-
" merged_crispr_df = merged_crispr_df[(merged_crispr_df['baseline_area_polr2a'] > threshold_polr2a)]\n",
14729+
" #threshold_polr2a = 10\n",
14730+
" #merged_crispr_df = merged_crispr_df[(merged_crispr_df['baseline_area_polr2a'] > threshold_polr2a)]\n",
1473114731
" \n",
1473214732
" # 1. Plot correlation between methods\n",
1473314733
" print(\"- Generating correlation scatter plot...\")\n",
@@ -14813,7 +14813,7 @@
1481314813
"source": [
1481414814
"**Plotting ground truth Enhancer-Gene pairs**\n",
1481514815
"\n",
14816-
"Here we will plot ground truth Enhancer-Gene pairs in K562 cells, obtained by CRISPR KO of enhancers and measuring the induced gene expression changes. This data was downloaded from the [Engreitz lab's github](https://github.com/EngreitzLab/CRISPR_comparison/tree/main/resources/crispr_data), referenced as a benchmarking dataset in [A. Gschwind et al.](https://doi.org/10.1101/2023.11.09.563812)."
14816+
"Here we will plot ground truth Enhancer-Gene pairs in K562 cells, obtained by CRISPR KO of enhancers and measuring the induced gene expression changes. This data was downloaded from the [Engreitz lab's github](https://github.com/EngreitzLab/CRISPR_comparison/tree/main/resources/crispr_data), referenced as a benchmarking dataset in [Gschwind et al.](https://doi.org/10.1101/2023.11.09.563812)."
1481714817
]
1481814818
},
1481914819
{

0 commit comments

Comments
 (0)