Skip to content

collaborativebioinformatics/PRSAggretator

Repository files navigation

PRSAggregator

Workflow diagram

A modular framework for structural profiling, visualization, and aggregation of polygenic risk scores

Overview Diagram

Screenshot 2026-01-09 at 1 31 51 PM

Background

Polygenic Risk Scores (PRS) are widely used to estimate genetic susceptibility to complex diseases. For many common traits and diseases, multiple PRS have been developed by different studies using diverse cohorts, methodologies, and SNP selection strategies.

Although the PGS Catalog provides harmonized PRS data, researchers still lack practical tools to compare multiple PRS at the structural level and to understand how these scores relate to one another before downstream use.

Motivation

Aggregating multiple PRS has the potential to improve robustness and generalizability. However, PRS aggregation is challenging because:

  • Different PRS often use partially overlapping but non-identical SNP sets
  • Redundancy and complementarity between PRS are unclear
  • PRS selection is often arbitrary and poorly justified

Before aggregating PRS, it is essential to understand how they overlap and differ.

What This Project Does

This project provides a framework to summarize, visualize, and explore overlap among multiple PRS using harmonized data from the PGS Catalog.

Specifically,our project contains three main part:

  • PRS Profiling: Build a wrapper pipeline to summarize SNP- and gene-level information across multiple PRS. And visualize using Upset Plot.
  • PRS Locus Viewer: Enable an interactive tool to explore of SNPs and genes in genomic context
  • PRS Federated Representation: Establish an representation learning approach for PRS scores across ancestry and trait from different studies.

1. PRS Profiling

Detailed information locates under /PRS_Structural_Profiling/

Motivation

This module provides a reproducible pipeline to summarize and structurally compare multiple polygenic risk scores (PRS) prior to aggregation.

The goal of this track is to answer a fundamental question:

Before aggregating PRS, how similar are they—at the SNP and gene levels?

Methods

input file format

PRS score files downloaded from the PGS Catalog [https://www.pgscatalog.org], with Genome build: GRCh37

Each input file must include the following columns:

  • hm_chr — chromosome (numeric, no chr prefix)
  • hm_pos — base-pair position (GRCh37 / hg19)

Example Command to run the pipeline

Rscript Rscript_PRSAggregator_Summarization.R \
  --files PGS000020_hmPOS_GRCh37.txt.gz,PGS000804_hmPOS_GRCh37.txt.gz,PGS001818_hmPOS_GRCh37.txt.gz \
  --out results \
  --flank 50000

Results

We used Type 2 Diabetes as an example and we took 3 PGS scores overall.

Structural summary of PRSs for Type 2 Diabetes (T2D)

PRS ID # SNPs SNPs within genes # Genes (genic) SNPs within ±50kb # Genes (±50kb)
PGS000020 7,502 3,541 2,735 5,227 7,624
PGS000804 578 342 366 496 1,183
PGS001818 30,745 14,137 5,084 21,379 12,805

UpSet plot (SNP overlap)

upset_snp

UpSet plot (Gene overlap)

upset_gene

2. PRS Locus Viewer

PRS Locus Viewer is an interactive Dash application designed for visualizing Polygenic Risk Score (PRS) variants in their genomic context. It allows researchers to inspect SNP effect sizes across multiple scoring files, map variants to nearby genes, and explore specific loci dynamically.

Key Features

  • Multi-Score Comparison: Overlay effect weights from multiple PRS files side-by-side.
  • Interactive Visualization: Clickable SNP tracks with heatmap-style coloring based on effect size.
  • Gene Mapping: Automatically identifies and visualizes genes within a configurable window (e.g., ±25kb) of reported SNPs.
  • Locus Zoom: Search by rsID or Gene Symbol to zoom into specific genomic regions.
PRS_viewer_preview.mp4

3. Federated Representation Learning for Polygenic Risk Scores

Motivation

Folder: https://github.com/collaborativebioinformatics/PRSAggretator/tree/main/Federated%20Representation

A representation learning framework for Polygenic Risk Scores (PGS) that enables systematic analysis of genetic architecture across ancestries and studies, without requiring individual-level genotype data.

The framework:

Screenshot 2026-01-09 at 2 03 20 PM
  • Integrates heterogeneous PGS scoring files across ancestries and studies
  • Harmonizes variants at the locus (genomic position) level
  • Enriches variants with biological annotations (genes, gene regions, mutation types)

It learns interpretable embeddings at two levels:

  • PGS-level embeddings → cohort / ancestry representations
  • Variant-level embeddings → locus representations

These embeddings are interpreted through the lens of:

  • Ancestry
  • Disease / trait
  • Cross-ancestry sharing of variants

The entire pipeline naturally aligns with a federated learning perspective, where each PGS acts as a client and shared vs ancestry-specific signals emerge from the learned representations.


Methods and Results

1. Data Harmonization & Feature Construction

Screenshot 2026-01-09 at 1 42 27 PM
  • Parsed PGS Catalog scoring files across multiple ancestries
  • Unified heterogeneous formats (different genome builds, weight types)
  • Defined a canonical locus identifier using harmonized coordinates: locus_id = hm_chr : hm_pos
  • Normalized effect sizes into a common numeric space (weight_scaled)
  • Annotated each locus using Ensembl VEP: (genomic information)

2. Autoencoder I — PGS-Level Representation Learning

Objective:
Learn compact embeddings that capture how each PGS distributes genetic risk across loci.

Input:

  • Sparse matrix: PGS × loci
  • Values = normalized effect sizes

Architecture (PGS Autoencoder):

  • Encoder:
  • Dense projection (sparse → latent)
  • Nonlinear activation
  • Decoder:
    • Reconstructs original PGS risk profile
  • Loss:
    • Masked reconstruction loss (emphasizes non-zero effects)
  • Evaluation:
    • Train/test split
    • Cosine similarity in input space

Output:

  • One embedding per PGS representing:
  • ancestry-specific genetic architecture
  • disease-related signal
  • similarity to other cohorts

PGS-Level Embeddings

  • PCA / UMAP plots
    PGS embeddings cluster by ancestry even for the same disease, highlighting population-specific genetic architectures.
Screenshot 2026-01-09 at 1 43 22 PM
  • Cosine distance heatmaps
    PGSs derived from similar ancestries show higher similarity in genetic risk profiles.
Screenshot 2026-01-09 at 1 43 49 PM

3. Autoencoder II — Variant-Level Representation Learning

Key Idea:
Transpose the problem and treat each variant as a vector across PGSs.

Variant = [effect in PGS₁, effect in PGS₂, …]

Objective:
Learn embeddings that capture how loci behave across ancestries and studies.

Architecture (Variant Autoencoder):

  • Input:
    • Variant × PGS effect vectors
  • Encoder:
    • Low-dimensional latent space (compact locus representation)
  • Decoder:
    • Reconstructs variant effect profile
  • Filtering:
    • Optional minimum PGS support (focus on shared variants)

Output:

  • One embedding per variant encoding:
    • consistency vs heterogeneity across ancestries
    • shared vs ancestry-specific behavior

Variant-Level Embeddings

  • **Variant PCA / UMAP **
Screenshot 2026-01-09 at 1 45 22 PM Screenshot 2026-01-09 at 1 45 37 PM Screenshot 2026-01-09 at 1 45 56 PM
  • Sharedness stratification (shared_2, shared_3, …)
    Widely shared variants show structured dispersion, suggesting consistent but nuanced cross-population effects.
Screenshot 2026-01-09 at 1 46 23 PM

Federated Learning Perspective

  • Each PGS = client
  • Each trait can have a different model (AE)
  • Each variant = model parameter
  • Shared variants behave like global parameters
  • Ancestry-specific variants act as client-private signals

This framework enables privacy-preserving, interpretable analysis of genetic architecture across populations without pooling individual-level data.


Future directions

Future work should scale the framework beyond a small set of T2D scores to broader disease areas and a larger, more diverse set of ancestries and traits, enabling more robust conclusions about cross-phenotype and cross-population sharing. Methodologically, harmonization can be strengthened by incorporating stricter allele alignment and quality controls, LD-aware locus grouping when appropriate, and expanded functional annotations (e.g., regulatory elements, eQTL links, pathway context) to improve interpretability of variant embeddings. A key next step is to connect “structural similarity” to “predictive behavior” by benchmarking whether embedding proximity and overlap metrics correlate with external performance measures, calibration, and generalizability, including evaluation of ensemble and aggregation strategies informed by these representations. From a federated learning perspective, the client-like abstraction can be extended to institution or biobank level deployments where sharing is restricted, using privacy preserving summaries and standardized outputs to enable cross site comparison without centralizing sensitive data. The locus viewer and profiling pipeline can evolve into a more automated, end-to-end toolkit with reproducible configuration, phenotype centric organization, and a web enabled interface that reduces manual file handling while preserving local execution options for restricted environments.

Initial Workflow

Screenshot 2026-01-08 at 12 41 09 PM

Contributers

Name Email ORCID Institution
Ashok K. Sharma ashoks773@gmail.com https://orcid.org/0000-0002-2264-7628 Cedars-Sinai Medical Center, LA
Dmitriy Ivkov Divkov@umich.edu https://orcid.org/0009-0008-4536-3274 University of Michigan, Michigan
Jasmine Baker jasmine.baker@bcm.edu https://orcid.org/0000-0001-7545-6086 Baylor College of Medicine, Houston
Mengying Hu meh251@pitt.edu https://orcid.org/0000-0003-4827-3051 University of Pittsburgh, Pittsburgh
Qianqian Liang qil57@pitt.edu https://orcid.org/0000-0002-1737-5031 Population Health Sciences, Geisinger, Danville, PA
Shivank Sadasivan ssadasiv@andrew.cmu.edu https://orcid.org/0009-0004-4699-2129 Carnegie Mellon University, Pittsburgh

About

Polygenic Risk Aggregation in common diseases and phenotypes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7