✨ Tutorial edits

mpielies · mpielies · commit e1ec8f334794 · 2025-03-23T17:05:13.000+01:00
- Provide environment file to ease reproducibility.
- Add content on adapting tutorial to own data.
diff --git a/.github/workflows/test_tutorial.yaml b/.github/workflows/test_tutorial.yaml
@@ -17,10 +17,17 @@ jobs:
           python-version: "3.11.11"
           activate-environment: claster-env
 
-      - name: Create Conda environment from YAML
+      - name: Create or update Conda environment from YAML
         shell: bash -el {0}
         run: |
-          conda env create -f environment/claster-env.yml
+          # Check if environment exists and update it if it does
+          if conda env list | grep -q "claster-env"; then
+            echo "Environment already exists, updating..."
+            conda env update -f environment/claster-env.yml --prune
+          else
+            echo "Creating new environment..."
+            conda env create -f environment/claster-env.yml
+          fi
 
       - name: Install papermill
         shell: bash -el {0}
@@ -48,10 +55,17 @@ jobs:
           python-version: "3.11.11"
           activate-environment: claster-env
 
-      - name: Create Conda environment from YAML
+      - name: Create or update Conda environment from YAML
         shell: bash -el {0}
         run: |
-          conda env create -f environment/claster-env.yml
+          # Check if environment exists and update it if it does
+          if conda env list | grep -q "claster-env"; then
+            echo "Environment already exists, updating..."
+            conda env update -f environment/claster-env.yml --prune
+          else
+            echo "Creating new environment..."
+            conda env create -f environment/claster-env.yml
+          fi
 
       - name: Install website dependencies
         shell: bash -el {0}
diff --git a/.gitignore b/.gitignore
@@ -29,5 +29,4 @@ targets/test_targets_*.csv
 targets/*1kbp*
 targets/training_targets_Enformer.csv
 training_targets.csv
-HiC-Pro-3.1.0
 *.log
diff --git a/scripts/0_Tutorial.ipynb b/scripts/0_Tutorial.ipynb
@@ -21,7 +21,7 @@
     "\n",
     "Hi! \n",
     "\n",
-    "Welcome to this small tutorial on how to build and run CLASTER using the EIR framework. Please clone this repository (CLASTER) on your computer to start. We will guide you through the main steps of the pipeline using a subset of our data, and finish the tutorial providing guidelines on how to extend the analyses to your own datasets.\n",
+    "Welcome to this small tutorial on how to build and run CLASTER using the EIR framework. Please clone this repository (CLASTER) on your computer to start. We will guide you through the main steps required to train and predict using CLASTER, and also how to adapt the pipeline to work with own data.\n",
     "\n",
     "### About CLASTER\n",
     "\n",
@@ -58,13 +58,14 @@
     "\n",
     ">💻 **Create an environment for the project:**\n",
     ">\n",
-    ">The following steps will be performed from the terminal, and once the environment is set up we will run everything else from this notebook.\n",
-    ">First we need to create an environment for this project, where all the required dependencies will be installed. We provide an environment file ```../environment/claster-env.yml``` to ease reproducibility and avoid conflicts between versions of different packages.\n",
-    ">If you have anaconda, the environment can be created from the terminal by typing:\n",
+    ">The following steps will be performed from the terminal, and once the environment is set up we will run everything else from this notebook. \n",
+    ">We will first create an environment for this project, where we will install all the required dependencies. To ease the process and avoid conflicts between versions of different dependencies, we provide a working environment configuration file,  ```../environment/claster-env.yml```.\n",
+    ">If you have anaconda, the environment can be created using the yml file from the terminal by typing:\n",
     ">\n",
     ">```bash\n",
-    ">conda env create -f ../environment/claster-env.yml #  Create environment from predefined yml file\n",
-    ">conda activate claster-env #  Activate it"
+    ">conda env create -f ../environment/claster-env.yml # Create environment\n",
+    ">conda activate claster-env #  Activate it\n",
+    ">```"
    ]
   },
   {
@@ -111,7 +112,19 @@
     "- Input arrays can be found at the folders ```inputs/landscape_arrays/test/``` and ```inputs/microC_rotated/test/```. \n",
     "- The matching target profiles are given in a tabular format and can be found in ```targets/test_targets.csv```.\n",
     "\n",
-    "> *Note: As a standard data augmentation procedure, samples were provided in their natural orientation (SampleID_forward.npy) and flipped. (SampleID_forward.npy)*"
+    "> *Note: As a standard data augmentation procedure, samples were provided in their natural orientation (SampleID_forward.npy) and flipped. (SampleID_forward.npy)*\n",
+    "\n",
+    ">**How do I extend this to my dataset?**\n",
+    ">\n",
+    ">*Inputs:* \n",
+    ">\n",
+    ">We need to store all input samples in a folder, e.g. ```/inputs/```. Each sample will be a numpy array with the name {SAMPLE_ID}.npy, of shape (#tracks, sequence length). In our case, #tracks = 4 (ATAC, H3K4me3, H3K27ac, H3K27me3) and sequence length = 10001 (bins of 100bp).\n",
+    ">\n",
+    ">*Targets:*\n",
+    ">\n",
+    ">Targets are provided as a table, where:\n",
+    ">- Columns are ID + name_of_output_1, name_of_ouput_2, etc. In our case we called them ID, -200_ctrl, -199_ctrl, etc.\n",
+    ">- Rows correspond to the sample ID (without .npy) and the table is filled with target values. In our case these were 1kbp read averages for 401 output nodes."
    ]
   },
   {
@@ -1125,30 +1138,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 3.  How do I adapt the pipeline to my data?\n",
+    "## 3. How to adapt it to work with your own data\n",
+    "\n",
+    "Input and target files were already given in an eir-friendly format for this tutorial. Now we will provide some guidelines on how to adapt this pipeline to train CLASTER with your own data. For more details, please have a look at the notebook ```I_Data_obtention.ipynb```. We also refer the reader to [eir.readthedocs.io](https://eir.readthedocs.io/en/latest/) for more information on how to extend the analyses to new data modalities.\n",
     "\n",
-    "Here we provide some guidelines on how to transform BigWig files into our desired, eir-friendly input and target data formats. More detailed information can be found in `I_Data_obtention.ipynb`. We refer the reader to [eir.readthedocs.io](https://eir.readthedocs.io/en/latest/) to find more extensive documentation on how to potentially extend this approach to new contexts and data modalities.\n",
+    "**You need:**\n",
+    "- BigWig files containing genomewide chromatin mark enrichments. These can usually be found in [NCBI's GEO](https://www.ncbi.nlm.nih.gov/geo/).\n",
+    "- Gene annotations file. Gene annotations were obtained for hg19 from [gencode](https://www.gencodegenes.org/human/release_19.html). We used hg19 because the input data from the Danko lab was mapped to hg19 originally.\n",
     "\n",
     "**Inputs:**\n",
+    "- Chromatin landscape inputs are provided as numpy arrays of shape (number of tracks, sequence length). \n",
+    "- In CLASTER, number of tracks is 4 (ATAC-seq, H3K4me3, H3K27ac, H3K27me3) and sequence length is 10001 (bins at 100bp resolution). \n",
+    "- Given a BigWig file with the enrichment of a chromatin mark, we can use the python package `pyBigWig` to extract a numpy array with the enrichment of the signal inside predefined genomic boundaries in a desired number of bins as follows:\n",
+    "\n",
+    "```python \n",
+    "    bw = pyBigWig.open(str(data_path / bw_path), \"r\")\n",
+    "    stats = bw.stats(chrom,start,end,type=\"mean\",nBins=n_input_bins)\n",
+    "    bw.close()\n",
+    "    stats = np.array([float(value) if value is not None else 0. for value in stats])\n",
+    "    stats = np.clip(np.array(stats),0,None) # ReLU\n",
+    "```\n",
+    "- We centered our samples at the TSS of protein coding genes found in the gene annotations file, and used their Ensemble ID to name each input sample (e.g. `ENSMUSG00000000085.16_forward.npy`). All input samples were stored in the same folder, `inputs/landscape_arrays/`.\n",
+    "- We then just need to stack the 1D arrays for the different marks in the same region, and give the resulting array a name or ID. \n",
     "\n",
-    "- We need to store all input samples in a folder, e.g. ```/inputs/```. \n",
-    "- Each input sample is a numpy array with the name {SAMPLE_ID}.npy, of shape ($n_{tracks}$, $seqlen$). In our case, $n_{tracks}=4$ (ATAC, H3K4me3, H3K27ac, H3K27me3) and sequence length = 10001 (bins of 100bp). \n",
-    "    - We can obtain these samples using the package ```pyBigWig```. In particular, given a BigWig file with our chromatin mark:\n",
-    "    ```python\n",
-    "        bw = pyBigWig.open(str(data_path / bw_path), \"r\")\n",
-    "        stats = bw.stats(chromosome,start_coordinate,end_coordinate,type=\"mean\",nBins=n_input_bins)\n",
-    "        stats = np.array([float(value) if value is not None else 0. for value in stats])\n",
-    "    ```\n",
-    "    This would provide us a row in the input array, and we'd need to join the different marks (See `I_Data_obtention.ipynb` for details). \n",
+    "**Targets:**\n",
     "\n",
-    "- CLASTER can be extended to different numbers of tracks, but that requires editing the corresponding configuration file used to build the model. In our case, we would modify in ```input_cnn.yaml``` the value of ```first_kernel_expansion_height``` to the new value of $n_{channels}$.\n",
+    "- Targets are provided as a table, e.g. a csv file. \n",
+    "- The header contains an index as the first column ('ID') and as many columns as output targets. In CLASTER, we named them `-200_ctrl`,`-199_ctrl`,...,`199_ctrl`,`200_ctrl`, i.e. we had 401 output nodes.\n",
+    "- Rows are then named after the sample ID (without the .npy prefix), and are filled with the target values matching our inputs. In CLASTER, we had EU-seq enrichment values binned at 1kbp resolution for the central 401 bins.\n",
     "\n",
-    "**Targets:**\n",
+    "-  You can follow the same steps as above: \n",
+    "    - Download the bigwig file for the target genomic track.\n",
+    "    - Extract the signal inside our desired boundaries and resolution using `pyBigWig`.\n",
+    "    - Add the array as a row in the targets csv file matching the ID of the corresponding input.\n",
     "\n",
-    "Targets are provided as table or dataframe, where:\n",
-    "- Columns are ID + name_of_output_1, name_of_ouput_2, etc. In our case we called them ID, -200_ctrl, -199_ctrl, etc.\n",
-    "- Rows correspond to the sample ID ({SAMPLE_ID} without .npy) and the table is filled with target values. In the manuscript these were 1kbp read averages for EU-seq provided in 401 output nodes.\n",
-    "- Hence, we can follow the same principles as above. First, load the BigWig file with the output track we want to predict, extract the values as a numpy array within the boundaries of interest, and update the values in the row corresponding to our sample (See `I_Data_obtention.ipynb` for details.)"
+    "For more advanced analyses like _in silico_ perturbations of the inputs, we refer the reader to `IV_Revisions.ipynb` and `III_Data_analysis.ipynb` for an earlier version of the perturbations."
    ]
   },
   {