Bradley Buchner
This repository contains the scripts and notebooks for my final project in the course CS7170 (AI for Complex Systems Modeling), which centers on an exploratory analysis of biVI: a generative model introduced by Carilli et al. (2024) as a biophysically-grounded alternative to scVI (Lopez et al., 2018) for analyzing scRNA-seq data. To run the analysis yourself, go to 'bivi_investigation.ipynb' and click "Run in Colab" at the top of the file.
While traditional VAE models for single-cell data analysis like scVI treat nascent (
My project builds on the authors' analysis of biVI by addressing two questions:
- Can biVI distinguish between real and biologically compromised data better than scVI?
- How much biological information does biVI capture that scVI misses, and can it be quantified?
By answering these, I aim to demonstrate the value of biVI's biophysical grounding in more depth.
For the first question, I designed an experiment to evaluate each model's ability to reconstruct real and biologically compromised data. To create the compromised dataset I randomly shuffled mature (
The second question was inspired by the authors' use of biVI to reveal "hidden regulation" of certain genes. Their analysis built on the assumption that genes whose expression remains the same from one cell to another do so in two ways: passively, by changing nothing; or actively, by changing RNA production and degradation rates to prevent a change in abundance. Since the generative element of biVI consists of a dynamical model parameterized by burst frequency (
In my analysis, I sought to quantify this hidden regulatory mechanism for the full transcriptome to determine how much biological activity can be made visible by biVI. To do this, I first identified all instances of a gene NOT being differentially expressed between two cell types in the held-out dataset, checked whether its burst size (
More broadly, my project highlights the importance of embedding biological laws into machine learning models, and moving from a purely data-driven paradigm to one that is more knowledge-driven. Part 1 has implications for biological foundation models, as a model's ability to generalize out-of-distribution is closely tied to how biologically faithful its representations are. Conversely, Part 2 points to the importance of designing deep models that prioritize biological interpretability, which both enhances scientific understanding and drives future analysis. In short, biologically-grounded models will be crucial for AI to advance biology.
Carilli, M., Gorin, G., Choi, Y., Chari, T., & Pachter, L. (2024). Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Nature Methods, 21(8), 1466–1469. https://doi.org/10.1038/s41592-024-02365-9
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053–1058. https://doi.org/10.1038/s41592-018-0229-2
biVI GitHub repository: https://github.com/pachterlab/CGCCP_2023