Automated variant network assembly and prediction framework

This repository implements:

An approach for deriving variant-specific networks from the INDRA Database, connecting genetic variants to their downstream biological and disease processes potentially via molecular intermediaries. (indra_variant.generate module)
A web application that allows browsing and interacting with these networks, deployed at https://variants.indra.bio. (indra_variant.app module)
An integrative, transformer-based prediction framework that makes use of the variant-effect networks as well as protein language model embeddings and other features to predict variant-effect mechanisms. (indra_variant.predict module)

Requirements

python >= 3.9

Data availability:

Place the following files under data/:

training_feature_input.tsv - input features for training
genes_to_pmids.tsv - mapping of proteins to relevant publications
label_classified.tsv - mapping of 1085 bp/disease labels to 30 categories
human_domains.tsv - domain info from UniProt uniprot_sprot.dat.gz (downloaded May 13, 2025)
clinvar_patho_subset.tsv.gz - extract the fields from the ClinVar clinvar_variant_summary.txt (.gz)

Usage

Generate protein embeddings

Use predict/seq_embedding/embedding.py to convert protein features into ESM-2 embeddings for downstream tasks.

input: data/training_feature_input.tsv
output: training_feature_esm2.tsv

Extract knowledge-graph paths

Use predict/gnn_pretraining/extract_triples.py to extract triples (subject, rel, object) from literature-derived causal paths.

input: data/training_feature_esm2.tsv
output: triples.csv; triples_unique.csv; variant_paths.tsv

Generate node embeddings using GNN

Use predict/gnn_pretraining/rgcn_pretrain.py to learn structural representations of nodes in the knowledge graph.

input: triples_unique.csv
output: node_embeds.pt (graph node embedding weights)

Split dataset

Use predict/train/dataset_split.py to split the dataset randomly for model training and evaluation.

input: data/label_clssified.tsv; variant_paths.tsv
output: splits/(splits_index)

Build training dataset

Use predict/train/build_train_dataset.py to assemble complete training dataset by combining embeddings, paths, and label mappings.

input: training_feature_esm2.tsv; variant_paths.tsv; data/label_classified.tsv; node_embeds.pt
output: path_dataset_bag_full.pt; W_var.pt

Train model

Use predict/train/training.py.

input: path_dataset_bag_full.pt

Run the script with the -h argument to see additional arguments.

Predict

Use predict/train/prediction.py.

Run the script with the -h argument to see additional arguments.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
indra_variants		indra_variants
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated variant network assembly and prediction framework

Requirements

Data availability:

Usage

Generate protein embeddings

Extract knowledge-graph paths

Generate node embeddings using GNN

Split dataset

Build training dataset

Train model

Predict

About

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

License

gyorilab/indra_variants

Folders and files

Latest commit

History

Repository files navigation

Automated variant network assembly and prediction framework

Requirements

Data availability:

Usage

Generate protein embeddings

Extract knowledge-graph paths

Generate node embeddings using GNN

Split dataset

Build training dataset

Train model

Predict

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Uh oh!

Languages