PRSAggregator

A modular framework for structural profiling, visualization, and aggregation of polygenic risk scores

Overview Diagram

Background

Polygenic Risk Scores (PRS) are widely used to estimate genetic susceptibility to complex diseases. For many common traits and diseases, multiple PRS have been developed by different studies using diverse cohorts, methodologies, and SNP selection strategies.

Although the PGS Catalog provides harmonized PRS data, researchers still lack practical tools to compare multiple PRS at the structural level and to understand how these scores relate to one another before downstream use.

Motivation

Aggregating multiple PRS has the potential to improve robustness and generalizability. However, PRS aggregation is challenging because:

Different PRS often use partially overlapping but non-identical SNP sets
Redundancy and complementarity between PRS are unclear
PRS selection is often arbitrary and poorly justified

Before aggregating PRS, it is essential to understand how they overlap and differ.

What This Project Does

This project provides a framework to summarize, visualize, and explore overlap among multiple PRS using harmonized data from the PGS Catalog.

Specifically,our project contains three main part:

PRS Profiling: Build a wrapper pipeline to summarize SNP- and gene-level information across multiple PRS. And visualize using Upset Plot.
PRS Locus Viewer: Enable an interactive tool to explore of SNPs and genes in genomic context
PRS Federated Representation: Establish an representation learning approach for PRS scores across ancestry and trait from different studies.

1. PRS Profiling

Detailed information locates under /PRS_Structural_Profiling/

Motivation

This module provides a reproducible pipeline to summarize and structurally compare multiple polygenic risk scores (PRS) prior to aggregation.

The goal of this track is to answer a fundamental question:

Before aggregating PRS, how similar are they—at the SNP and gene levels?

Methods

input file format

PRS score files downloaded from the PGS Catalog [https://www.pgscatalog.org], with Genome build: GRCh37

Each input file must include the following columns:

hm_chr — chromosome (numeric, no chr prefix)
hm_pos — base-pair position (GRCh37 / hg19)

Example Command to run the pipeline

Rscript Rscript_PRSAggregator_Summarization.R \
  --files PGS000020_hmPOS_GRCh37.txt.gz,PGS000804_hmPOS_GRCh37.txt.gz,PGS001818_hmPOS_GRCh37.txt.gz \
  --out results \
  --flank 50000

Results

We used Type 2 Diabetes as an example and we took 3 PGS scores overall.

Structural summary of PRSs for Type 2 Diabetes (T2D)

PRS ID	# SNPs	SNPs within genes	# Genes (genic)	SNPs within ±50kb	# Genes (±50kb)
PGS000020	7,502	3,541	2,735	5,227	7,624
PGS000804	578	342	366	496	1,183
PGS001818	30,745	14,137	5,084	21,379	12,805

UpSet plot (SNP overlap)

UpSet plot (Gene overlap)

2. PRS Locus Viewer

PRS Locus Viewer is an interactive Dash application designed for visualizing Polygenic Risk Score (PRS) variants in their genomic context. It allows researchers to inspect SNP effect sizes across multiple scoring files, map variants to nearby genes, and explore specific loci dynamically.

Key Features

Multi-Score Comparison: Overlay effect weights from multiple PRS files side-by-side.
Interactive Visualization: Clickable SNP tracks with heatmap-style coloring based on effect size.
Gene Mapping: Automatically identifies and visualizes genes within a configurable window (e.g., ±25kb) of reported SNPs.
Locus Zoom: Search by rsID or Gene Symbol to zoom into specific genomic regions.

PRS_viewer_preview.mp4

3. Federated Representation Learning for Polygenic Risk Scores

Motivation

Folder: https://github.com/collaborativebioinformatics/PRSAggretator/tree/main/Federated%20Representation

A representation learning framework for Polygenic Risk Scores (PGS) that enables systematic analysis of genetic architecture across ancestries and studies, without requiring individual-level genotype data.

The framework:

Integrates heterogeneous PGS scoring files across ancestries and studies
Harmonizes variants at the locus (genomic position) level
Enriches variants with biological annotations (genes, gene regions, mutation types)

It learns interpretable embeddings at two levels:

PGS-level embeddings → cohort / ancestry representations
Variant-level embeddings → locus representations

These embeddings are interpreted through the lens of:

Ancestry
Disease / trait
Cross-ancestry sharing of variants

The entire pipeline naturally aligns with a federated learning perspective, where each PGS acts as a client and shared vs ancestry-specific signals emerge from the learned representations.

Methods and Results

1. Data Harmonization & Feature Construction

Parsed PGS Catalog scoring files across multiple ancestries
Unified heterogeneous formats (different genome builds, weight types)
Defined a canonical locus identifier using harmonized coordinates: locus_id = hm_chr : hm_pos
Normalized effect sizes into a common numeric space (weight_scaled)
Annotated each locus using Ensembl VEP: (genomic information)

2. Autoencoder I — PGS-Level Representation Learning

Objective:
Learn compact embeddings that capture how each PGS distributes genetic risk across loci.

Input:

Sparse matrix: PGS × loci
Values = normalized effect sizes

Architecture (PGS Autoencoder):

Encoder:
Dense projection (sparse → latent)
Nonlinear activation
Decoder:
- Reconstructs original PGS risk profile
Loss:
- Masked reconstruction loss (emphasizes non-zero effects)
Evaluation:
- Train/test split
- Cosine similarity in input space

Output:

One embedding per PGS representing:
ancestry-specific genetic architecture
disease-related signal
similarity to other cohorts

PGS-Level Embeddings

PCA / UMAP plots
PGS embeddings cluster by ancestry even for the same disease, highlighting population-specific genetic architectures.

Cosine distance heatmaps
PGSs derived from similar ancestries show higher similarity in genetic risk profiles.

3. Autoencoder II — Variant-Level Representation Learning

Key Idea:
Transpose the problem and treat each variant as a vector across PGSs.

Variant = [effect in PGS₁, effect in PGS₂, …]

Objective:
Learn embeddings that capture how loci behave across ancestries and studies.

Architecture (Variant Autoencoder):

Input:
- Variant × PGS effect vectors
Encoder:
- Low-dimensional latent space (compact locus representation)
Decoder:
- Reconstructs variant effect profile
Filtering:
- Optional minimum PGS support (focus on shared variants)

Output:

One embedding per variant encoding:
- consistency vs heterogeneity across ancestries
- shared vs ancestry-specific behavior

Variant-Level Embeddings

**Variant PCA / UMAP **

Sharedness stratification (shared_2, shared_3, …)
Widely shared variants show structured dispersion, suggesting consistent but nuanced cross-population effects.

Federated Learning Perspective

Each PGS = client
Each trait can have a different model (AE)
Each variant = model parameter
Shared variants behave like global parameters
Ancestry-specific variants act as client-private signals

This framework enables privacy-preserving, interpretable analysis of genetic architecture across populations without pooling individual-level data.

Future directions

Future work should scale the framework beyond a small set of T2D scores to broader disease areas and a larger, more diverse set of ancestries and traits, enabling more robust conclusions about cross-phenotype and cross-population sharing. Methodologically, harmonization can be strengthened by incorporating stricter allele alignment and quality controls, LD-aware locus grouping when appropriate, and expanded functional annotations (e.g., regulatory elements, eQTL links, pathway context) to improve interpretability of variant embeddings. A key next step is to connect “structural similarity” to “predictive behavior” by benchmarking whether embedding proximity and overlap metrics correlate with external performance measures, calibration, and generalizability, including evaluation of ensemble and aggregation strategies informed by these representations. From a federated learning perspective, the client-like abstraction can be extended to institution or biobank level deployments where sharing is restricted, using privacy preserving summaries and standardized outputs to enable cross site comparison without centralizing sensitive data. The locus viewer and profiling pipeline can evolve into a more automated, end-to-end toolkit with reproducible configuration, phenotype centric organization, and a web enabled interface that reduces manual file handling while preserving local execution options for restricted environments.

Initial Workflow

Contributers

Name	Email	ORCID	Institution
Ashok K. Sharma	ashoks773@gmail.com	https://orcid.org/0000-0002-2264-7628	Cedars-Sinai Medical Center, LA
Dmitriy Ivkov	Divkov@umich.edu	https://orcid.org/0009-0008-4536-3274	University of Michigan, Michigan
Jasmine Baker	jasmine.baker@bcm.edu	https://orcid.org/0000-0001-7545-6086	Baylor College of Medicine, Houston
Mengying Hu	meh251@pitt.edu	https://orcid.org/0000-0003-4827-3051	University of Pittsburgh, Pittsburgh
Qianqian Liang	qil57@pitt.edu	https://orcid.org/0000-0002-1737-5031	Population Health Sciences, Geisinger, Danville, PA
Shivank Sadasivan	ssadasiv@andrew.cmu.edu	https://orcid.org/0009-0004-4699-2129	Carnegie Mellon University, Pittsburgh

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
Federated Representation		Federated Representation
PRS_Structural_Profiling		PRS_Structural_Profiling
PRS_visualization		PRS_visualization
synthetic_generation		synthetic_generation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
synthetic.py		synthetic.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PRSAggregator

Overview Diagram

Background

Motivation

What This Project Does

1. PRS Profiling

Motivation

Methods

input file format

Example Command to run the pipeline

Results

Structural summary of PRSs for Type 2 Diabetes (T2D)

UpSet plot (SNP overlap)

UpSet plot (Gene overlap)

2. PRS Locus Viewer

Key Features

3. Federated Representation Learning for Polygenic Risk Scores

Motivation

Methods and Results

1. Data Harmonization & Feature Construction

2. Autoencoder I — PGS-Level Representation Learning

PGS-Level Embeddings

3. Autoencoder II — Variant-Level Representation Learning

Variant-Level Embeddings

Federated Learning Perspective

Future directions

Initial Workflow

Contributers

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

collaborativebioinformatics/PRSAggretator

Folders and files

Latest commit

History

Repository files navigation

PRSAggregator

Overview Diagram

Background

Motivation

What This Project Does

1. PRS Profiling

Motivation

Methods

input file format

Example Command to run the pipeline

Results

Structural summary of PRSs for Type 2 Diabetes (T2D)

UpSet plot (SNP overlap)

UpSet plot (Gene overlap)

2. PRS Locus Viewer

Key Features

3. Federated Representation Learning for Polygenic Risk Scores

Motivation

Methods and Results

1. Data Harmonization & Feature Construction

2. Autoencoder I — PGS-Level Representation Learning

PGS-Level Embeddings

3. Autoencoder II — Variant-Level Representation Learning

Variant-Level Embeddings

Federated Learning Perspective

Future directions

Initial Workflow

Contributers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages