The main objective of this project is to perform an ecological analysis of the cell types in NSCLC lung tumor tissues from scRNA-seq data, in both early and late tumor stages.
The chosen methodology is to annotate cell types with the help of a newly trained reference atlas. For this, we use the scvi framework. We base our work on the study by Salcher, Sturm, Horvath et al. 2022, where they gathered and standardized most of the data.
Additionally, we expanded the atlas with four more datasets, optimized the hyperparameters of our deep learning model, and trained it to also predict tumor cell types. The prediction was better than with the most used method InferCNV.
Subsequently, we obtained coabundance metrics. We chose the mutual inference (MI) metric, as it captures non-linear correlations in the data, and the methodology in the ARACNE-AP package ensures a smaller possibility of spurious correlations.
After that, we obtained various visualizations of the networks and extracted mesoscopic and global properties of the graphs.
The workflow for getting from the raw AnnData files to the coabundance graphs and their analysis can be seen in the figure below:
To see the order for running the workflow, got to the Folder Structure -> Notebooks and scripts section. There is an explanation of the notebooks in the order they sould be run
We have designed a Docker image that has all the necessary libraries. It is, however, very large because it includes all the R and Python packages, including those for ML. It is around 15GB without InferCNV and Ikarus. So if you just want to run a notebook separately, look at the versions of the packages in the Dokcefile.
To run the notebook, you should run the container with this notebook repository mounted as a volume. In the following command, $HOME/2021-SC-HCA-LATAM/CONTAINER is the path where your notebooks are, and the other path is where the large data files would be stored.
docker run --interactive --runtime=nvidia --gpus all --tty --name comp_onco --shm-size=200g\
--volume $HOME/2021-SC-HCA-LATAM/CONTAINER:/root/host_home --volume /datos:/root/datos --workdir /root/host_home/ netopaas/comp-onco:annots /bin/bashshm-size will alow you to run multiple workers in scvi.
You must install the apt package docker-nvidia for the GPU flags to work and, of course, have a working CUDA installation.
We publish some ports to use the Jupyter server.
After that, just run the command jl inside the container, and a Jupyter Lab server will be launched.
To acces it from a local computer you can do some port forwarding with something akin to this command:
ssh -p 5265 -N -f -L 5432:localhost:8888 sefirot.inmegen.gob.mxWhere 5432 is the localport where you will be running the server. 8888 is the port in the remote machine where juptyer server is being forwarded to by the container. And -p 5265 is the por tof the remote host. You have to have and ssh-agent with your user credentials for this to work. We recommend it as it frees you of the hassle of inputting a password everytime.
You can also skip the port forwarding and work in a more integrated environment, by using the remote explorer and dev containers feature of Vscode.
Just add a new host for the remote host via shh and open a new Vscode window inside that host. Vscode will automatically install all the necessary packages to be able to interact with the host. After that go to the Dev Containers tab in the Remote Explorer tab and choose the container that you ran as per the instructions above.
When opening a notebook, we recommend attaching to a running jupyter server, which is launched with the jl command in a shell in the container, like above.
This avoids losing the running kernel if you close the Vscode window you are working on.
A problem with Vscode is that it does not offer a way to shutdown notebooks in an external jupyter server, you'll have to do it manually with
a notebook to delete not recently used kernels in ./utils/cull-kernels.ipynb. It uses api calls.
The workflow is spread across various notebooks, where one can see the figures generated in the proccess and some short explanations. We plan to automatize the process with Nextflow, like they do in the repo of the scLUCA project.
But for now all the necessary steps are contained in the notebooks and scripts, though a little scrambled. The notebooks are explained in the order in which they should be run.
-
get_data.ipynb: Contains the code for exploring and downloading all possible datasets to extend the scLUCA atlas. It was important to have UMI counts and tumor stage annotation.
-
nb_filter: Filters the cells and genes by predefined quality control metrics.
-
nb_refAtlas: Contains the notebook
vae_raytunefor running and inspecting various experiments of hyperparameter exploration. The notebookscANVImodel_WuBatchhas the reasoning and training of the actual reference atlas, used everywhere else. -
nb_annotRefatlas: Annotates tissue from a new study by doing surgery and has integrated quality plots. Also annotates broad tumor cell types. Includes a notebook for annotation with label transfer via neighbors, but it had worse outcomes (
labelTransfer). There is also a notebook that transfers the newly created clusters to another dataset (extendPreds_{dataset}.ipynb). -
nb_subCluster: It includes
Tumor_subcluster.ipynbthat redefines the atlas with new unsupervised tumor cells. Some attempts to accelerate this withfaiss(GPU accel) are in the python scripts starting with faiss.HCL.ipynbis an attempt to do a hierarchical statisitcally siginifcant approach tu clustering. But we abandoned it because it depends on the order of addition of the dataset.
-
grouping_MI.ipynb: It contains the crucial parts of gathering all annotations, grouping them by samples and cell type, running the coabundance analysis by extracting MI.
-
nb_graphAnalysis:
- graph_param.ipynb Parametrized notebook for doing bipartite analysis of our networks. There isnt much to analyze because there arent that many varied coabundances between immune and tumoral.
- graph_layouts.ipynb Leiden, Newman, n-clique, interactive, and functionally enriched visualizations of the networks for different clusterings and times.
- pearson_compare.ipynb Adds pearson correlation information to the existing MI graph and analyzes pairs of vars that are not pearson correlated and their profile. Also for the negative ones.
- sbm_cluster.ipynb Does stochastic block modelling (thanks to the impressive package
graph-tools) of the network and outputs the most beautifual circos plot layouts of the graphs. Also very informative. - abund_clusters.ipynb Leiden clustering of the abundances matrix, with lots of informative graphics of them like Degree centrality plots, Graphs, Snakey plots, Boxplots and such
The dependecies of
graph-toolsare a hassle, so we recommend running it in their container. Instructions are in the nb. - graph_param.ipynb Extract more mesoscopic features of the graph, like redundancy and degree plots...
- circos.ipynb Circos plots with igraph, not neccesary with the help of
graph-toolsfor cells types. But for cell categories the inter categories coabundances plots are here. - cellphone.ipynb Cellphone circos plots and dot plots for late, early and their ecotypes
-
nb_DE_wilcox:
- _.ipynb Extracts marker genes of clusters from existing cell annotations with the Wilcox method. It also enriches for Hallmark gene ontologies. Its a bit bit convoluted and doesnt consider batch effects, but only because scanpy doenst consider abundance of cell types.
- modal_DE.py script to run the Wilcox marker gene extraction in modal. It generalizes well for all datasets, but one has to upload the files to the volume manually for now. Additionally it has some flags to run it locally. We prefer this to the notebooks.. that are only left there for the figures and explanations of older runs.
- test_pipeline.py are some basic tests for modal_DE.py
- aggregate_markers.py aggregate all dsets marker genes, and plot the highest. Also plot GSEA NES scores for cancer hallmarks. Also has a function to convert all marker genes objects to a count matrix composed of AUC of the genes, useful for CellphoneDB without statistical analysis (Method 1).
- DE_incomplete.ipynb is incomplete and attempts to do pseudo-bulk differential expression with MAST.
-
nb_DE_SCT: Extracts marker genes of clusters from existing cell annotations with the GLM method
SCTransform v2. It also corrects for batch effects per sample and enriches for Hallmark gene ontologies. This method has a parameter estimation method that corrects for lowly expressed genes and is much faster than thelvm_DEmethod, which takes advantage of our dimensional reduction VAEs with scANVI. Thetest_de.pyscript is to be run in the lambda function service Modal, as it requires a lot of GPU RAM and takes around 35 minutes for 3 samples. It uses thelvm_DEmethod mentioned above. -
nb_ikarus: Runs the ikarus prediction on every dataset from
ikaurs_fabric.py. A prediction that uses logistic regression and network projection to predict tumor cells. -
nb_infercnv: Runs InferCNV on every dataset from
infercnv_fabric.py. This infers from transcripts, places in the chromosomes where there should be copy number variations, hence cancer cells. -
nb_tumorUMAP: Notebook to check the tumor predictions. It has the DE part integrated.
Tumor_Annot.ipynbcontains explanations of the methods used. -
nb_annot: Annotates tissue from every study with Lung Atlas reference maps. Only healthy cells.
- outputARACNE: Contains all the files for the generation and output of the networks by ARACNE, including functional enrichment.
- metadata: Contains information about the studies used
dsets.csvand data about the number of cells per samplegroups_**. The filecell_mappings.jsoncentralizes shared dictionaries such as therapy labels, category colors, and cell-type groupings used across notebooks. - utils: Contains custom plotting and analysis functions.
- abundance.py: Functions for visualizing the abundance of cell types, subgrouping into ecotypes and plotting some ecotype networks. Is used a lot by
abund_clusters.ipynb. - change_cells.py: Contains the renaming of the cells defined in subclustering by functional and marker genes information Renames the netowrk, the groups and the count_matrix files
- postprocess_cluster_networks.py: Recreates the Pearson-vs-MI analysis from the notebook for every cluster, annotates ARACNe networks with Pearson stats, exports nonlinear edge tables, and can render the trio of scatter/heatmap plots used to inspect MI-only pairs.
- run_aracne_by_cluster.py: Prepares per-cluster count matrices and transcription factor lists, runs ARACNe-AP with the chosen bootstrap parameters, and writes the consolidated mutual-information networks ready for downstream annotation.
- circos_cluster_plots.py: Loads the annotated cluster networks, fits stochastic block models with graph-tool, and regenerates the circos-style blockmodel visualizations along with cached block states.
- plot_cell_pair.py: Plots scatter plots of cell pairs correlations for a specific cluster/stage, comparing linear vs log scales. Option
--no-plus1allows plotting raw counts. Output filename is auto-generated if not provided.
- abundance.py: Functions for visualizing the abundance of cell types, subgrouping into ecotypes and plotting some ecotype networks. Is used a lot by
The only data neccesary to run the workflow from scratch are:
/root/datos/maestria/netopaas/luca/data/atlas/extended.h5adWhich can be gotten from https://cellxgene.cziscience.com/collections/edb893ee-4066-4128-9aec-5eb2b03f8287/root/datos/maestria/netopaas/<ds_suffix><year>/The datasets downloaded in the noteboookget_data.ipynb. We tried lots but stayed with only Bishoff, Hu, Zuani and Deng.- Also the metadata in
metadata/dsets.csvdescribes all the datasets and one can check the inclusion criteria for the analysis in there.
-
/root/datos/maestria/netopaas/luca/data/atlas/ Has the LUCA atlas files from Salcher et. al. It has pretty extensive metadata. We have saved some subsets, the most import being 'extended_tumor_hvg.h5ad' that uses the extended atlas, but only has tumor samples and highly var genes.
-
/root/datos/maestria/netopaas/<ds_suffix>/ Folder where all the files necessary to download the dataset to an acceptable h5ad are.
-
/root/datos/maestria/netopaas/lung_scRNA/LUCA_model/ Contains the models trained by us, but also the ones from LUCA. In particular '/hvg_integrated_scvi_scanvi_tumor_model_b128_lay4_h1024_raydefault_epocs300-300/' is the one with the highest accuracy, trained in 'nb_ref_Atlas/scANVImodel.ipynb'
-
/root/datos/maestria/netopaas/luca_explore/surgeries/ Has most of the checkpoints in our workflow.
-
filtered_<ds_suffix>.h5ad Is the adata of the dataset after QC filtering.
-
query_<ds_suffix>.h5ad Is the adata of the dataset in the necessary fromat to be annotated by the model.
-
<dataset_suffix>/ Folder that contains the trained surgery model in 'model.pt'. Most of them also have an adata of the latent vars in
query_latent.h5ad. Sometimes a csv to convert from gen symbols to ensembl in<ds_suffix>_ensembl.csv -
<ds_suffix>_predicted.csv The cell annotations with just our Tumor LUCA model. It also has all the entries of the obs matrix, so stage and such.
-
Subcluster/
- query_latent__.h5ad The merged embedding of all the annotated datasets for subclustering. SUbset to tumor and respective time stages.
- atlas__umap.csv The umap coords calculated online in modal.
- atlas__uparams.json The umap coords params calculated online in modal. Also params for the leiden algo.
- atlas__leiden.csv The subclustering assignment of cells.
- mapping__leiden.json The broad cluster annotation by chekcing contingency with previous cluster names.
-
<ds_suffix>_predicted_leiden.csv The cell anotations with tumor subclustering.
-
Group_checks/ Folder where the checkpoints for concatenating the groups are saved.
-
-
<notebooks_folder>/nb_DE_wilcox We save the marker genes and heatmap plots and its corresponding data here.
- wilcoxon_DE/[tumor]pair.npy The DE of pairs of cells for every gene. The tumor refers means that it is subsetted to certain cells. Pretty big files
- wilcoxon_DE/.npy The DE of cells in the region against all other cells, which is also denominated in . This is what is used for marker genes, and subsequent GSEA.
- wilcoxon_DE/markergenes_.png The plot of the 20 highest DE genes per cell type in region.
- gseapy_gsea/heatmap_.[csv][png] Table for and heatmap plot of the enrichment of cancer hallmarks for every cell type in region.
We have kept some varied docker images because of compatiblity issues. They are orderd from newest to oldest.
-
netopaas/comp-onco:annots 14.6G The image below but with BioMaRt. Was tested to be able to annotate new datasets with scvi.
-
netopaas/comp-onco:sctransform 16G Added support for sctransform, our only candidate for batch effect correction a the expression level. Does not have biomaRt. Does not have pandas==1.5.3
-
netopaas/comp_onco:r4 14.1G Newer R version to be able to convert from and to SeuratObject and Anndata. Does not have biomart. Does not have pandas==1.5.3
-
netopaas/comp-onco:raytune 13.6G Some fixes to be able to run the hyperparameter optimization tool raytune. This one does not have cuda aware jax.
-
netopaas/scarches:raytune 19.9 G The same but with more R packages than comp-onco. This one has jax.
-
netopaas/scarches:droplet 85G This one has infercnv, ikarus and scFusion that take a lot of space. It wont be able to run raytune and sctransform and others.
-
tiagopeixoto/graph-tool:latest 3.18G For creating our beautiful circos graphs. The dependencies are hell so we have them in a separate container.
-
netopaas/faiss:cugraph-24-12 For doing faiss nearest neghbours, leiden and umap much faster and using Modal. It has been a nightmare to keep the dependecies correct in this one.
Due to the long training and annealing times, Jupyter Lab sometimes cannot connect. Use this to get a console to the kernel:
jupyter console --existing /root/.local/share/jupyter/runtime/kernel-9ff04919-e8c2-4ecf-92ce-c66b988720e5.jsonFor this you need to have pip install jupyter-console installed.
This could be easier in a newer version of Jupyter Lab. To locate the corresponding JSON file, you can use htop with the option to not display user branches and see the memory it is using.
It is important to install the Docker NVIDIA package to transfer your CUDA installation to the containers.
If you keep your repository inside the container via volumes, the user might change. We recommend configuring the SSH keys inside the container. These are the steps:
git config --global user.email ernesto.paas@ciencias.unam.mx
git config --global user.name "Ernesto Paas"We suggest saving a key pair that can be generated with the command ssh-keygen -t ed25519 -C "your_email@example.com" in the Docker volume (folder) that contains the repository, and then copying them to ~/.ssh/id_ed25519 etc., to have SSH authentication with GitHub. Remember to have the SSH agent activated eval "$(ssh-agent -s)" and the key added ssh-add ~/.ssh/id_ed25519. Ensure that the key is read-only.
