Status: Proposed
Date: 2026-02-11
Authors: ruv.io, RuVector Team
Deciders: Architecture Review Board
Target Crates: ruvector-gnn, ruvector-graph
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-02-11 | ruv.io | Initial practical implementation proposal |
Protein structure prediction and interaction analysis are fundamental to drug discovery, variant effect prediction, and understanding disease mechanisms. Graph neural networks naturally represent protein structures at multiple scales, from atomic interactions to protein-protein interaction networks.
State-of-the-art approaches include:
- ESMFold: Meta's protein structure prediction using protein language models, achieving AlphaFold2-competitive accuracy without MSAs
- AlphaFold2 Evoformer: Iterative attention over MSAs and pairwise representations, O(N²) complexity
- ProteinMPNN: Message passing for inverse protein design, generates sequences matching target structures
- GearNet: Geometry-aware relational graph neural network for protein representation learning
RuVector's existing ruvector-gnn crate provides the foundational primitives for building protein graph models:
// Core layers available today
pub struct Linear { fn new(input_dim, output_dim), fn forward(&[f32]) -> Vec<f32> }
pub struct LayerNorm { fn new(dim, eps), fn forward(&[f32]) -> Vec<f32> }
pub struct MultiHeadAttention { fn new(embed_dim, num_heads), fn forward(query, keys, values) -> Vec<f32> }
pub struct GRUCell { fn new(input_dim, hidden_dim), fn forward(input, hidden) -> Vec<f32> }
pub struct RuvectorLayer { fn new(input_dim, hidden_dim, heads, dropout), fn forward(...) }
pub struct Tensor { fn new(Vec<f32>, Vec<usize>), fn matmul(), fn dot() }
pub struct Optimizer { fn new(OptimizerType), fn step(params, grads) }
// Loss functions
fn info_nce_loss(query, positive, negatives) -> f32
fn local_contrastive_loss(embeddings, labels) -> f32We will build a ProteinGraphEngine that:
- Represents protein contact graphs using
ruvector-graphfor storage and query - Implements residue-level message passing via
RuvectorLayerfor contact prediction - Applies GNN-based approaches to protein interaction prediction (PPI)
- Integrates with the genomic attention layers (ADR-001 through ADR-004) for variant effect analysis
What works today: GNN message passing layers, graph storage, HNSW indexing What needs building: SE(3) equivariant layers, protein-specific feature encoders, specialized architectures
Goal: Predict residue-residue contacts from sequence, enabling structure prediction.
Graph representation:
G_contact = (V, E, X_v, X_e)
V = {r_1, r_2, ..., r_N} -- one node per residue
E = {(r_i, r_j) : predicted contact or known from structure}
X_v in R^{N x d_v} where d_v = 41:
- Amino acid type (20-dim one-hot)
- Secondary structure (3-dim: helix, strand, coil)
- Relative position i/N (1-dim)
- Conservation score (1-dim)
- MSA-derived features (16-dim)
X_e in R^{|E| x d_e} where d_e = 7:
- Sequence separation |i-j|/N (1-dim)
- Co-evolution score (1-dim)
- Distance encoding (5-dim RBF basis)
ruvector-graph storage:
use ruvector_graph::{GraphDB, NodeBuilder, EdgeBuilder};
pub struct ProteinContactGraph {
db: GraphDB,
protein_id: String,
}
impl ProteinContactGraph {
pub fn from_sequence(sequence: &str, msa: Option<&MultipleAlignment>) -> Self {
let mut db = GraphDB::new();
let n = sequence.len();
// Add residue nodes
for (i, aa) in sequence.chars().enumerate() {
let features = encode_residue_features(aa, i, n, msa);
db.add_node(NodeBuilder::new()
.with_label("Residue")
.with_property("index", i)
.with_property("amino_acid", aa.to_string())
.with_property("features", features)
.build());
}
// Add predicted contact edges (from GNN or co-evolution)
let contact_probs = predict_contacts(&db, sequence);
for (i, j, prob) in contact_probs {
if prob > 0.5 { // Threshold
db.add_edge(EdgeBuilder::new()
.from(i).to(j)
.with_label("Contact")
.with_property("probability", prob)
.with_property("seq_sep", ((j - i) as f32 / n as f32))
.build());
}
}
Self { db, protein_id: format!("protein_{}", uuid::Uuid::new_v4()) }
}
}
fn encode_residue_features(aa: char, pos: usize, len: usize, msa: Option<&MultipleAlignment>) -> Vec<f32> {
let mut features = vec![0.0; 41];
// One-hot amino acid (20-dim)
let aa_idx = AA_TO_INDEX[&aa];
features[aa_idx] = 1.0;
// Normalized position
features[20] = pos as f32 / len as f32;
// Conservation (from MSA if available)
features[21] = msa.map(|m| m.conservation_at(pos)).unwrap_or(0.5);
// MSA-derived coevolution features (16-dim)
if let Some(m) = msa {
let coevo = m.coevolution_features(pos);
features[22..38].copy_from_slice(&coevo);
}
// Remaining features: secondary structure prediction, etc.
features
}Task: Predict contact probability for all residue pairs.
Network architecture:
use ruvector_gnn::layer::{RuvectorLayer, Linear, LayerNorm, MultiHeadAttention};
use ruvector_gnn::optimizer::{Optimizer, OptimizerType};
pub struct ContactPredictor {
layers: Vec<RuvectorLayer>,
edge_predictor: Linear,
norm: LayerNorm,
hidden_dim: usize,
}
impl ContactPredictor {
pub fn new(input_dim: usize, hidden_dim: usize, num_layers: usize, num_heads: usize) -> Self {
let mut layers = Vec::new();
// First layer: input_dim -> hidden_dim
layers.push(RuvectorLayer::new(input_dim, hidden_dim, num_heads, 0.1));
// Hidden layers: hidden_dim -> hidden_dim
for _ in 1..num_layers {
layers.push(RuvectorLayer::new(hidden_dim, hidden_dim, num_heads, 0.1));
}
Self {
layers,
edge_predictor: Linear::new(hidden_dim * 2, 1), // Predict contact from pair
norm: LayerNorm::new(hidden_dim, 1e-5),
hidden_dim,
}
}
pub fn forward(
&self,
node_features: &[Vec<f32>],
edge_index: &[(usize, usize)],
edge_weights: &[f32],
) -> Vec<Vec<f32>> {
let mut h = node_features.to_vec();
// Message passing layers
for layer in &self.layers {
h = self.apply_layer(layer, &h, edge_index, edge_weights);
}
// Normalize final embeddings
h.iter().map(|emb| self.norm.forward(emb)).collect()
}
fn apply_layer(
&self,
layer: &RuvectorLayer,
node_features: &[Vec<f32>],
edge_index: &[(usize, usize)],
edge_weights: &[f32],
) -> Vec<Vec<f32>> {
let n = node_features.len();
let mut outputs = Vec::with_capacity(n);
for i in 0..n {
// Gather neighbors
let neighbors: Vec<_> = edge_index.iter()
.enumerate()
.filter(|(_, (src, _))| *src == i)
.map(|(idx, (_, dst))| (*dst, edge_weights[idx]))
.collect();
if neighbors.is_empty() {
outputs.push(node_features[i].clone());
continue;
}
let neighbor_features: Vec<_> = neighbors.iter()
.map(|(j, _)| node_features[*j].clone())
.collect();
let weights: Vec<f32> = neighbors.iter().map(|(_, w)| *w).collect();
// RuvectorLayer aggregates neighbors with attention
let h_i = layer.forward(&node_features[i], &neighbor_features, &weights);
outputs.push(h_i);
}
outputs
}
pub fn predict_contacts(&self, embeddings: &[Vec<f32>]) -> Vec<(usize, usize, f32)> {
let mut contacts = Vec::new();
let n = embeddings.len();
for i in 0..n {
for j in (i + 5)..n { // Only pairs with |i-j| >= 5 (long-range)
// Concatenate pair embeddings
let mut pair_emb = embeddings[i].clone();
pair_emb.extend_from_slice(&embeddings[j]);
// Predict contact probability
let logit = self.edge_predictor.forward(&pair_emb)[0];
let prob = 1.0 / (1.0 + (-logit).exp()); // Sigmoid
if prob > 0.01 { // Only store confident predictions
contacts.push((i, j, prob));
}
}
}
contacts
}
}
// Training loop
pub fn train_contact_predictor(
model: &mut ContactPredictor,
train_proteins: &[Protein],
num_epochs: usize,
) -> Result<()> {
let mut optimizer = Optimizer::new(OptimizerType::Adam { lr: 0.001, beta1: 0.9, beta2: 0.999 });
for epoch in 0..num_epochs {
let mut total_loss = 0.0;
for protein in train_proteins {
// Get node features, edges, ground truth contacts
let node_features = protein.residue_features();
let edge_index = protein.sequence_edges(); // Sequential + MSA-based
let edge_weights = vec![1.0; edge_index.len()];
// Forward pass
let embeddings = model.forward(&node_features, &edge_index, &edge_weights);
let predicted = model.predict_contacts(&embeddings);
// Compute loss (binary cross-entropy on contacts)
let ground_truth = protein.contact_map(); // From known structure
let loss = bce_loss(&predicted, &ground_truth);
// Backward pass (gradients computed manually or via autograd)
// ... gradient computation ...
// Optimizer step
// optimizer.step(&mut model.parameters(), &gradients);
total_loss += loss;
}
println!("Epoch {}: Loss = {:.4}", epoch, total_loss / train_proteins.len() as f32);
}
Ok(())
}Goal: Predict whether two proteins interact based on sequence, structure, and network topology.
Graph representation:
G_PPI = (V_protein, E_interact, X_protein)
V_protein = {p_1, ..., p_K} -- K proteins in the interactome
X_protein in R^{K x d} -- Protein feature vectors (d=256)
Features per protein:
- ESM-2 sequence embedding (128-dim)
- Gene Ontology terms (64-dim binary)
- Subcellular localization (12-dim one-hot)
- Expression profile (16-dim from GTEx)
- Domain composition (36-dim Pfam fingerprint)
Implementation:
pub struct PPIPredictor {
encoder: RuvectorLayer, // Encode protein features
gnn_layers: Vec<RuvectorLayer>, // Message passing over PPI graph
link_predictor: Linear, // Predict interaction from pair embedding
}
impl PPIPredictor {
pub fn new(input_dim: usize, hidden_dim: usize, num_layers: usize) -> Self {
let encoder = RuvectorLayer::new(input_dim, hidden_dim, 8, 0.1);
let mut gnn_layers = Vec::new();
for _ in 0..num_layers {
gnn_layers.push(RuvectorLayer::new(hidden_dim, hidden_dim, 8, 0.1));
}
let link_predictor = Linear::new(hidden_dim * 3, 1); // Concat + Hadamard
Self { encoder, gnn_layers, link_predictor }
}
pub fn predict_interaction(&self, protein_i: &[f32], protein_j: &[f32], graph: &PPIGraph) -> f32 {
// Encode proteins
let h_i = self.encoder.forward(protein_i, &[], &[]);
let h_j = self.encoder.forward(protein_j, &[], &[]);
// Message passing (aggregate neighbor information)
let h_i_agg = self.aggregate_neighbors(&h_i, graph.neighbors_of(protein_i));
let h_j_agg = self.aggregate_neighbors(&h_j, graph.neighbors_of(protein_j));
// Link prediction: [h_i || h_j || h_i ⊙ h_j]
let mut pair_emb = h_i_agg.clone();
pair_emb.extend_from_slice(&h_j_agg);
let hadamard: Vec<f32> = h_i_agg.iter().zip(&h_j_agg).map(|(a, b)| a * b).collect();
pair_emb.extend_from_slice(&hadamard);
let logit = self.link_predictor.forward(&pair_emb)[0];
1.0 / (1.0 + (-logit).exp()) // Sigmoid
}
fn aggregate_neighbors(&self, embedding: &[f32], neighbors: &[Vec<f32>]) -> Vec<f32> {
if neighbors.is_empty() {
return embedding.to_vec();
}
let weights = vec![1.0; neighbors.len()];
let mut h = embedding.to_vec();
for layer in &self.gnn_layers {
h = layer.forward(&h, neighbors, &weights);
}
h
}
}Goal: Connect variant effects to protein structure changes and interaction disruption.
Pipeline:
pub struct VariantToProteinPipeline {
contact_model: ContactPredictor,
ppi_model: PPIPredictor,
}
impl VariantToProteinPipeline {
/// Predict how a missense variant affects protein structure
pub fn predict_structural_impact(&self, gene: &str, variant: &Variant) -> StructuralImpact {
// 1. Get protein sequence and apply variant
let wt_seq = get_protein_sequence(gene);
let mut mt_seq = wt_seq.clone();
mt_seq[variant.position] = variant.alt_aa;
// 2. Predict contact maps for WT and mutant
let wt_graph = ProteinContactGraph::from_sequence(&wt_seq, None);
let mt_graph = ProteinContactGraph::from_sequence(&mt_seq, None);
let wt_contacts = self.contact_model.predict_contacts(&wt_graph.embeddings());
let mt_contacts = self.contact_model.predict_contacts(&mt_graph.embeddings());
// 3. Compare contact maps
let contact_change = compute_contact_difference(&wt_contacts, &mt_contacts);
StructuralImpact {
contact_disruption: contact_change,
predicted_pathogenicity: if contact_change > 0.3 { "Pathogenic" } else { "Benign" },
}
}
/// Predict how a variant affects protein-protein interactions
pub fn predict_interaction_impact(&self, gene: &str, variant: &Variant, interactors: &[String]) -> Vec<InteractionChange> {
let mut changes = Vec::new();
let wt_features = get_protein_features(gene);
let mut mt_features = wt_features.clone();
apply_variant_to_features(&mut mt_features, variant);
for interactor in interactors {
let partner_features = get_protein_features(interactor);
let wt_score = self.ppi_model.predict_interaction(&wt_features, &partner_features, &ppi_graph);
let mt_score = self.ppi_model.predict_interaction(&mt_features, &partner_features, &ppi_graph);
changes.push(InteractionChange {
partner: interactor.clone(),
wt_score,
mt_score,
delta: mt_score - wt_score,
});
}
changes
}
}- GNN message passing:
RuvectorLayerwith multi-head attention and GRU updates - Graph storage:
ruvector-graph::GraphDBfor protein graphs - Training infrastructure:
Optimizerwith Adam, loss functions - Linear transformations:
Linearlayers for projections - Layer normalization:
LayerNormfor stable training
-
SE(3) equivariance: Coordinate-aware message passing requires extending
RuvectorLayerto handle 3D positions. This needs a separateEquivariantLayerthat maintains separate scalar (invariant) and vector (equivariant) channels. -
Protein feature encoders: MSA processing, co-evolution calculation, ESM-2 embedding extraction
-
Contact map evaluation: Precision@L, precision@L/5 metrics for structure prediction
-
PPI training data pipeline: Integration with STRING, BioGRID, IntAct databases
| Task | Target | Current Capability |
|---|---|---|
| Residue contact prediction (300 residues) | < 100 ms | ✅ Achievable with RuvectorLayer (8 layers) |
| PPI prediction (single pair) | < 10 ms | ✅ Achievable with RuvectorLayer (3 layers) |
| Variant structural impact | < 500 ms | ✅ Two forward passes + comparison |
| Batch PPI prediction (1000 pairs) | < 5 seconds | ✅ Parallelizable with batch inference |
| Method | Contact Precision@L | PPI AUROC | Handles Variants |
|---|---|---|---|
| AlphaFold2 | 0.90 | N/A | ❌ |
| ESMFold | 0.85 | N/A | ❌ |
| ProteinMPNN | N/A | N/A | ❌ (inverse design) |
| GearNet | 0.70 | 0.88 | ❌ |
| RuVector GNN | 0.65-0.75 (target) | 0.80-0.85 (target) | ✅ |
RuVector advantage: Native integration with variant calling pipeline (ADR-001-004), enabling real-time variant→structure→interaction effect prediction.
- Native variant integration: Directly connects genomic variants to protein-level effects
- Practical implementation: Uses existing
ruvector-gnnAPI without requiring new layers - Interpretable: Contact maps and PPI scores are clinically actionable
- Scalable: Message passing scales to proteome-wide interaction networks
- No SE(3) equivariance yet: Current implementation doesn't guarantee rotation/translation invariance
- Lower accuracy than AlphaFold2: Contact prediction is 10-15% below SOTA structure predictors
- Requires training data: PPI and contact prediction need labeled protein structures and interaction databases
- MSA dependency: Contact prediction degrades without multiple sequence alignments
- PPI noise: Experimental interaction databases have 20-30% false positive rate
- Generalization: Models trained on human proteins may not transfer to pathogens
-
Lin, Z. et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science, 379, 1123-1130. (ESMFold)
-
Jumper, J. et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature, 596, 583-589. (AlphaFold2 Evoformer)
-
Dauparas, J. et al. (2022). "Robust deep learning-based protein sequence design using ProteinMPNN." Science, 378, 49-56. (ProteinMPNN)
-
Zhang, Z. et al. (2023). "Protein Representation Learning by Geometric Structure Pretraining." ICLR 2023. (GearNet)
-
Szklarczyk, D. et al. (2023). "The STRING database in 2023: protein-protein association networks and functional enrichment analyses." Nucleic Acids Research, 51(D1), D483-D489. (STRING PPI database)
- ADR-001: RuVector Core Architecture (HNSW index for protein similarity)
- ADR-003: Genomic Vector Index (variant embeddings feed into protein models)
- ADR-006: Temporal Epigenomic Engine (integrates with gene expression changes)