A modular framework for structural profiling, visualization, and aggregation of polygenic risk scores
Polygenic Risk Scores (PRS) are widely used to estimate genetic susceptibility to complex diseases. For many common traits and diseases, multiple PRS have been developed by different studies using diverse cohorts, methodologies, and SNP selection strategies.
Although the PGS Catalog provides harmonized PRS data, researchers still lack practical tools to compare multiple PRS at the structural level and to understand how these scores relate to one another before downstream use.
Aggregating multiple PRS has the potential to improve robustness and generalizability. However, PRS aggregation is challenging because:
- Different PRS often use partially overlapping but non-identical SNP sets
- Redundancy and complementarity between PRS are unclear
- PRS selection is often arbitrary and poorly justified
Before aggregating PRS, it is essential to understand how they overlap and differ.
This project provides a framework to summarize, visualize, and explore overlap among multiple PRS using harmonized data from the PGS Catalog.
Specifically,our project contains three main part:
- PRS Profiling: Build a wrapper pipeline to summarize SNP- and gene-level information across multiple PRS. And visualize using Upset Plot.
- PRS Locus Viewer: Enable an interactive tool to explore of SNPs and genes in genomic context
- PRS Federated Representation: Establish an representation learning approach for PRS scores across ancestry and trait from different studies.
Detailed information locates under /PRS_Structural_Profiling/
This module provides a reproducible pipeline to summarize and structurally compare multiple polygenic risk scores (PRS) prior to aggregation.
The goal of this track is to answer a fundamental question:
Before aggregating PRS, how similar are they—at the SNP and gene levels?
PRS score files downloaded from the PGS Catalog [https://www.pgscatalog.org], with Genome build: GRCh37
Each input file must include the following columns:
hm_chr— chromosome (numeric, nochrprefix)hm_pos— base-pair position (GRCh37 / hg19)
Rscript Rscript_PRSAggregator_Summarization.R \
--files PGS000020_hmPOS_GRCh37.txt.gz,PGS000804_hmPOS_GRCh37.txt.gz,PGS001818_hmPOS_GRCh37.txt.gz \
--out results \
--flank 50000
We used Type 2 Diabetes as an example and we took 3 PGS scores overall.
| PRS ID | # SNPs | SNPs within genes | # Genes (genic) | SNPs within ±50kb | # Genes (±50kb) |
|---|---|---|---|---|---|
| PGS000020 | 7,502 | 3,541 | 2,735 | 5,227 | 7,624 |
| PGS000804 | 578 | 342 | 366 | 496 | 1,183 |
| PGS001818 | 30,745 | 14,137 | 5,084 | 21,379 | 12,805 |
PRS Locus Viewer is an interactive Dash application designed for visualizing Polygenic Risk Score (PRS) variants in their genomic context. It allows researchers to inspect SNP effect sizes across multiple scoring files, map variants to nearby genes, and explore specific loci dynamically.
- Multi-Score Comparison: Overlay effect weights from multiple PRS files side-by-side.
- Interactive Visualization: Clickable SNP tracks with heatmap-style coloring based on effect size.
- Gene Mapping: Automatically identifies and visualizes genes within a configurable window (e.g., ±25kb) of reported SNPs.
- Locus Zoom: Search by rsID or Gene Symbol to zoom into specific genomic regions.
PRS_viewer_preview.mp4
Folder: https://github.com/collaborativebioinformatics/PRSAggretator/tree/main/Federated%20Representation
A representation learning framework for Polygenic Risk Scores (PGS) that enables systematic analysis of genetic architecture across ancestries and studies, without requiring individual-level genotype data.
The framework:
- Integrates heterogeneous PGS scoring files across ancestries and studies
- Harmonizes variants at the locus (genomic position) level
- Enriches variants with biological annotations (genes, gene regions, mutation types)
It learns interpretable embeddings at two levels:
- PGS-level embeddings → cohort / ancestry representations
- Variant-level embeddings → locus representations
These embeddings are interpreted through the lens of:
- Ancestry
- Disease / trait
- Cross-ancestry sharing of variants
The entire pipeline naturally aligns with a federated learning perspective, where each PGS acts as a client and shared vs ancestry-specific signals emerge from the learned representations.
- Parsed PGS Catalog scoring files across multiple ancestries
- Unified heterogeneous formats (different genome builds, weight types)
- Defined a canonical locus identifier using harmonized coordinates: locus_id = hm_chr : hm_pos
- Normalized effect sizes into a common numeric space (
weight_scaled) - Annotated each locus using Ensembl VEP: (genomic information)
Objective:
Learn compact embeddings that capture how each PGS distributes genetic risk across loci.
Input:
- Sparse matrix: PGS × loci
- Values = normalized effect sizes
Architecture (PGS Autoencoder):
- Encoder:
- Dense projection (sparse → latent)
- Nonlinear activation
- Decoder:
- Reconstructs original PGS risk profile
- Loss:
- Masked reconstruction loss (emphasizes non-zero effects)
- Evaluation:
- Train/test split
- Cosine similarity in input space
Output:
- One embedding per PGS representing:
- ancestry-specific genetic architecture
- disease-related signal
- similarity to other cohorts
- PCA / UMAP plots
PGS embeddings cluster by ancestry even for the same disease, highlighting population-specific genetic architectures.
- Cosine distance heatmaps
PGSs derived from similar ancestries show higher similarity in genetic risk profiles.
Key Idea:
Transpose the problem and treat each variant as a vector across PGSs.
Variant = [effect in PGS₁, effect in PGS₂, …]
Objective:
Learn embeddings that capture how loci behave across ancestries and studies.
Architecture (Variant Autoencoder):
- Input:
- Variant × PGS effect vectors
- Encoder:
- Low-dimensional latent space (compact locus representation)
- Decoder:
- Reconstructs variant effect profile
- Filtering:
- Optional minimum PGS support (focus on shared variants)
Output:
- One embedding per variant encoding:
- consistency vs heterogeneity across ancestries
- shared vs ancestry-specific behavior
- **Variant PCA / UMAP **
- Sharedness stratification (shared_2, shared_3, …)
Widely shared variants show structured dispersion, suggesting consistent but nuanced cross-population effects.
- Each PGS = client
- Each trait can have a different model (AE)
- Each variant = model parameter
- Shared variants behave like global parameters
- Ancestry-specific variants act as client-private signals
This framework enables privacy-preserving, interpretable analysis of genetic architecture across populations without pooling individual-level data.
Future work should scale the framework beyond a small set of T2D scores to broader disease areas and a larger, more diverse set of ancestries and traits, enabling more robust conclusions about cross-phenotype and cross-population sharing. Methodologically, harmonization can be strengthened by incorporating stricter allele alignment and quality controls, LD-aware locus grouping when appropriate, and expanded functional annotations (e.g., regulatory elements, eQTL links, pathway context) to improve interpretability of variant embeddings. A key next step is to connect “structural similarity” to “predictive behavior” by benchmarking whether embedding proximity and overlap metrics correlate with external performance measures, calibration, and generalizability, including evaluation of ensemble and aggregation strategies informed by these representations. From a federated learning perspective, the client-like abstraction can be extended to institution or biobank level deployments where sharing is restricted, using privacy preserving summaries and standardized outputs to enable cross site comparison without centralizing sensitive data. The locus viewer and profiling pipeline can evolve into a more automated, end-to-end toolkit with reproducible configuration, phenotype centric organization, and a web enabled interface that reduces manual file handling while preserving local execution options for restricted environments.
| Name | ORCID | Institution | |
|---|---|---|---|
| Ashok K. Sharma | ashoks773@gmail.com | https://orcid.org/0000-0002-2264-7628 | Cedars-Sinai Medical Center, LA |
| Dmitriy Ivkov | Divkov@umich.edu | https://orcid.org/0009-0008-4536-3274 | University of Michigan, Michigan |
| Jasmine Baker | jasmine.baker@bcm.edu | https://orcid.org/0000-0001-7545-6086 | Baylor College of Medicine, Houston |
| Mengying Hu | meh251@pitt.edu | https://orcid.org/0000-0003-4827-3051 | University of Pittsburgh, Pittsburgh |
| Qianqian Liang | qil57@pitt.edu | https://orcid.org/0000-0002-1737-5031 | Population Health Sciences, Geisinger, Danville, PA |
| Shivank Sadasivan | ssadasiv@andrew.cmu.edu | https://orcid.org/0009-0004-4699-2129 | Carnegie Mellon University, Pittsburgh |
