Skip to content

Bradley-Buchner/cs7170_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Biophysically-grounded AI with biVI

Bradley Buchner

This repository contains the scripts and notebooks for my final project in the course CS7170 (AI for Complex Systems Modeling), which centers on an exploratory analysis of biVI: a generative model introduced by Carilli et al. (2024) as a biophysically-grounded alternative to scVI (Lopez et al., 2018) for analyzing scRNA-seq data. To run the analysis yourself, go to 'bivi_investigation.ipynb' and click "Run in Colab" at the top of the file.

While traditional VAE models for single-cell data analysis like scVI treat nascent ($N$) and mature ($M$) RNA counts as independent measurements, biVI leverages the causal link between the two by learning parameters of a joint bivariate distribution representing RNA lifecycle kinetics. By conditioning the number of mature ($M$) RNA on the number of nascent ($N$) RNA in the same cell, biVI improves upon scVI and ensures the generative process is grounded in known cellular dynamics.

Objective

My project builds on the authors' analysis of biVI by addressing two questions:

  1. Can biVI distinguish between real and biologically compromised data better than scVI?
  2. How much biological information does biVI capture that scVI misses, and can it be quantified?

By answering these, I aim to demonstrate the value of biVI's biophysical grounding in more depth.

Findings

Part 1

For the first question, I designed an experiment to evaluate each model's ability to reconstruct real and biologically compromised data. To create the compromised dataset I randomly shuffled mature ($M$) RNA counts in the dataset held-out for testing while keeping nascent ($N$) counts fixed, effectively breaking the causal link between the two species within each cell. Using the real held-out dataset as a positive control, I passed concatenated RNA count matrices ($N+M$) into a trained biVI or scVI model, assessed the reconstruction error for each cell, and compared each cell's difference in reconstruction error for real and compromised data. As expected, my analysis showed that this difference was signifcantly larger for biVI, which consistently struggled to reconstruct RNA counts for cells in the compromised dataset while maintaining good performance on the real dataset. This finding validates biVI's reliance on RNA lifecyle dynamics and suggests that, as a generative model, the RNA counts it generates are more likely to be biologically plausible than those generated by scVI.

Part 2

The second question was inspired by the authors' use of biVI to reveal "hidden regulation" of certain genes. Their analysis built on the assumption that genes whose expression remains the same from one cell to another do so in two ways: passively, by changing nothing; or actively, by changing RNA production and degradation rates to prevent a change in abundance. Since the generative element of biVI consists of a dynamical model parameterized by burst frequency ($k$), burst size ($b$), relative splicing rate ($\beta/k$), and degradation rate ($\gamma/k$), it allowed the authors to highlight special cases where genes that do NOT differentially express between cells actually compensate with a significant change in kinetic rate parameters.

In my analysis, I sought to quantify this hidden regulatory mechanism for the full transcriptome to determine how much biological activity can be made visible by biVI. To do this, I first identified all instances of a gene NOT being differentially expressed between two cell types in the held-out dataset, checked whether its burst size ($b$) and relative degradation rate ($\gamma/k$) differed substantially between types (assuming the other parameters are not cell-type dependent), and calculated the percentage of instances where this was true. Using the standard log-fold change threshold of 0.5 to define substantial differences, my analysis found that on average, kinetic compensation was prevalent in ~25% of these instances. While the authors showed that scVI and biVI have comparable performance when it comes to reconstructing scRNA-seq data, this finding quantitatively clarifies their difference and further justifies biVI's nuanced design.

Takeaways

More broadly, my project highlights the importance of embedding biological laws into machine learning models, and moving from a purely data-driven paradigm to one that is more knowledge-driven. Part 1 has implications for biological foundation models, as a model's ability to generalize out-of-distribution is closely tied to how biologically faithful its representations are. Conversely, Part 2 points to the importance of designing deep models that prioritize biological interpretability, which both enhances scientific understanding and drives future analysis. In short, biologically-grounded models will be crucial for AI to advance biology.

Citations and Links

Carilli, M., Gorin, G., Choi, Y., Chari, T., & Pachter, L. (2024). Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Nature Methods, 21(8), 1466–1469. https://doi.org/10.1038/s41592-024-02365-9

Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053–1058. https://doi.org/10.1038/s41592-018-0229-2

biVI GitHub repository: https://github.com/pachterlab/CGCCP_2023

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors