Skip to content

nhgritctran/PheTK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,164 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PheTK - The Phenotype Toolkit

The official repository of PheTK, a fast python library for Phenome Wide Association Studies (PheWAS) utilizing both phecode 1.2 and phecodeX 1.0.

Reference: Tam C Tran, David J Schlueter, Chenjie Zeng, Huan Mo, Robert J Carroll, Joshua C Denny, PheWAS analysis on large-scale biobank data with PheTK, Bioinformatics, Volume 41, Issue 1, January 2025, btae719, https://doi.org/10.1093/bioinformatics/btae719

Contact: PheTK@mail.nih.gov

🆕 WHAT'S NEW IN v0.2

Major updates in this release:

  • Cox regression support - Added survival analysis capabilities alongside logistic regression
  • dsub integration - Built-in support for distributed computing on Google Cloud Platform
  • Forest plot visualization - New main visualization option alongside Manhattan plots
  • PEP-compliant naming - Changed to lowercase package/module names (affects import syntax)
  • Expanded CLI support - Added command-line interfaces for cohort and phecode modules
  • Simplified CLI commands - Added entry points for easier CLI usage (e.g., phetk phewas instead of python3 -m phetk.phewas)
  • Enhanced user experience - Various improvements for clarity and usability

📋 View full changelog

Version 0.1.47 is the last stable version of version 0.1. Users can still continue to use this version, and the previous README file can be found here


QUICK LINKS


1. INSTALLATION

Using pip

The latest version (v0.2+) of PheTK can be installed using the pip install command in the terminal (note that the lowercase package name "phetk" starts from version 0.2+):

pip install phetk --upgrade

Users can also specify a version, e.g., for the last stable version of version 0.1 (note use "PheTK" instead of "phetk" for version 0.1):

pip install PheTK==0.1.47

To check current installed version:

pip show phetk | grep Version

Using Docker

Please refer to https://hub.docker.com/r/phetk/phetk/tags for the latest docker images.

docker pull phetk/phetk:latest

2. 1-MINUTE PHEWAS DEMO

User can run the quick 1-minute PheWAS demo with the following command in a terminal:

phetk demo

Or in Jupyter Notebook:

from phetk import demo

demo.run()

The example files (example_cohort.tsv, example_phecode_counts.tsv, and example_phewas_results.tsv) generated in this Demo should be in users' current working directory. New-to-PheWAS users could explore these files to get a sense of what data are used or generated in PheWAS with PheTK.

3. DESCRIPTIONS

PheTK is a fast python library for Phenome Wide Association Studies (PheWAS) utilizing both phecode 1.2 and phecodeX 1.0.

PheWAS workflow and PheTK modules Standard PheWAS workflow. Green italicized texts are PheTK module names. Black components are supported while gray ones are not supported by PheTK currently.

All of Us: the All of Us Research Program (https://allofus.nih.gov/)

4. USAGE

For detailed usage examples and documentation for each module, please refer to the individual module documentation:

  • Cohort module - Generate genetic cohorts and add covariates
  • Phecode module - Map ICD codes to phecodes and generate counts
  • PheWAS module - Run PheWAS analysis with logistic or Cox regression
  • Plot module - Generate Manhattan plots and other visualizations

5. SYSTEM REQUIREMENTS

PheTK was developed for efficient processing of large data while being resource-friendly. It was tested on different platforms from laptops to different cloud environments.

General Requirements

PheTK's resource requirements vary by usage context. The information in this section is tailored towards cloud computing platforms where large biobanks are often hosted.

  • All PheTK functions run on standard machines, except by_genotype() in the Cohort module which requires a Spark cluster (dataproc VM)
  • Both logistic regression and Cox regression scale with CPU counts for faster processing. See figure S2 below from PheTK publication for more information. In our experience, 4 CPU machines are the most cost-efficient, especially for large-scale analyses.
  • For an end-to-end pipeline, the system requirements should be based on the most demanding steps. For example, for the All of Us data v8, a VM with 16CPU 104GB RAM and 2 dataproc workers at default settings should work; if users only need to run PheWAS analysis, it can be run at a much lower configuration as shown in figure S2.

PheTK Performance Benchmarks Figure S2: Logistic regression performance benchmarks from PheTK publication showing scalability with different CPU configurations and cohort sizes.

PheWAS Module - Logistic Regression

  • Minimal resources required - Can run efficiently on lightweight configurations
  • Minimum tested configuration: GCP X-highcpu-4 (4 vCPUs, 8GB RAM, X=GCP machine type, e.g., c2d) or equivalent
  • Uses multithreading for parallel processing with lower memory overhead

PheWAS Module - Cox Regression

  • Slightly higher resources required - Uses multiprocessing which demands more memory
  • Minimum tested configuration: GCP X-standard-4 (4 vCPUs, 16GB RAM, X=GCP machine type, e.g., c2d) or equivalent
  • The additional memory accommodates the multiprocessing overhead for survival analysis

Phecode Module (ICD Code Mapping)

  • Memory requirements scale with cohort size - Large cohorts require higher memory configurations
  • Recommended: For All of Us database v8 with over 500k participants, phecode mapping could be done with a 16 vCPU 104GB RAM machine.

About

PheTK Official Repository

Resources

License

Stars

Watchers

Forks

Contributors

Languages