This repository contains the scripts and notebooks for SynergyGT, a model that serves as a computational tool for the Levine Lab and its research on the genetics of aging in the roundworm C. elegans. Although individual genes associated with aging and longevity have been well-studied, aging is a complex, emergent phenotype driven by a combination of nonlinear genetic interactions. Synergistic gene interactions offer a view into the complexity of aging, and by using deep learning to reveal such relationships in the genetic interaction network we can better understand how aging emerges. SynergyGT does this by combining knowledge of gene-aging associations and a network of mechanistic gene-gene interactions to learn features of known synergistic gene pairs that distinguish them from those that are not. With a design inspired by biology and modern LLMs, SynergyGT can predict synergistic interactions at a level significantly better than naive baselines. In practice, the model can be used to characterize the likelihood of synergy between any pair of genes, which could lead to the discovery of novel synergistic interactions and a better understanding of the genetic landscape of aging.
To use SynergyGT, there are two jupyter notebooks in the /notebooks directory that can easily be run in Google Colab:
- model_demo.ipynb walks you through the process of building and training a SynergyGT model and evaluating its performance.
- model_exploration.ipynb lets you interact with a trained SynergyGT model and test it on any gene or gene pair of interest.
Click the "Run in Colab" button at the top of these notebooks to run them yourself.
This section outlines the conceptual blueprint of the model, specifying the information it consumes, how that information is represented, what the model is trained to predict, and the assumptions under which its learns.
The model integrates three primary data sources:
1. Genetic interaction network (from WormBase)
A directed, heterogeneous interaction network representing known molecular and genetic relationships in C. elegans.
- 11,493 nodes (genes/proteins) and 90,364 edges
- 3 types of edges (interactions): genetic, physical, and regulatory
- Edges are directed to reflect causal relationships where applicable; non-causal interactions are represented by bidirectional edges
2. Gene-lifespan phenotype associations (from Gene Ontology)
Functional annotations linking genes to biological processes and molecular functions specifically associated with aging and longevity.
- Incorporates higher-level biological context through GO terms
- Provides a layer of functional information that complements the topology of the interaction network
3. Double mutant lifespan assays (from SynergyAge)
A curated collection of lifespan measurements for combinatorial genetic interventions in C. elegans.
- 1,458 double mutant experiments, 801 unique double mutants (i.e., gene perturbation pairs)
- Each experiment is categorized as resulting in an antagonistic, additive, or synergistic effect on lifespan
Together, these sources allow the model to leverage biological knowledge alongside experimental outcomes to uncover the hidden regulatory logic of C. elegans aging.
Each gene pair is represented as a localized subgraph from the global interaction network.
Pair subgraphs
- Node set: The union of the one-hop neighborhoods of both perturbed genes
- Edge set: All edges are retained from the global network, preserving directionality and interaction type
Node-level features
Each node within a subgraph is annotated with biologically and topologically motivated attributes:
- In-degree and out-degree
- Proximity (hop distance) to each perturbed gene
- Proximity (hop distance) to the nearest aging-associated gene (zero for aging-associated genes)
- Perturbation status and perturbation type (knockdown, knockout, or overexpression)
This representation encodes local network structure while also injecting relevant biological context for the prediction task.
The model maps pair-centered subgraphs to predicted interaction outcomes using a two-stage architecture: a subgraph encoder followed by a classification head.
Subgraph encoder
A graph transformer is used to encode each subgraph into a fixed-dimensional vector representation.
- Nodes are treated as tokens (analogous to how LLMs treat words as tokens), with graph structure incorporated directly into the attention mechanism.
- Each node attribute (e.g., degree, aging proximity, perturbation status) is embedded into a small fixed-dimensional vector; these embeddings are learned and summed to produce a single representation for each node.
- A synthetic [CLS] node added to each subgraph learns to aggregate information from all other nodes to form another small fixed-dimensional summary representation of the subgraph.
The calculation of attention between nodes is modified to capture:
- Causality: Nodes are enforced to attend only to their descendents in the directed graph.
- Proximity: Learnable biases based on hop distance between nodes.
- Interaction semantics: Separate value projections are learned for each interaction type (genetic, physical, regulatory).
This design imposes biologically motivated inductive biases while allowing the model to learn how information relevant to aging should flow through local neighborhoods and be aggregated.
Classification head
A multilayer perceptron with one hidden layer takes a CLS token's subgraph representation as input and outputs a 3-dimensional probability vector corresponding to antagonistic, additive, and synergistic interaction likelihoods.
The model is trained to minimize the Kullback–Leibler (KL) divergence between predicted and observed relative interaction-type frequencies for each gene pair. This formulation treats relative interaction-type frequencies as soft classification labels, and is an example of label distribution learning.
Note: To account for the diverse quantity of experiments recorded for each unique double mutant and, as a result, varying evidence and confidence levels, observed interaction-type counts were smoothed via Bayesian smoothing with a maximally ignorant prior that assumed pseudocounts of 1 for each interaction type (i.e., assumes all types are equally likely).
For any pair of gene perturbations, the model outputs predicted relative frequencies for antagonistic, additive, and synergistic lifespan effects. These predictions can be interpreted as a probability distribution over expected genetic interaction effects, and can be used to prioritize candidate gene pairs for experimental validation.
This project was developed in the Levine Lab for Systems Bioengineering at Northeastern University. For questions regarding the model or its repository, please open an issue or reach out via email or LinkedIn.
