Skip to content

Latest commit

 

History

History
217 lines (184 loc) · 7.62 KB

File metadata and controls

217 lines (184 loc) · 7.62 KB

PheWAS Module

Run phenome-wide association studies to test associations between a variable of interest and multiple phenotypes.

Running PheWAS

Performs logistic or Cox regression analysis across all phecodes, testing association with an independent variable while adjusting for covariates.

Key Parameters

  • phecode_version: Phecode version to use, "1.2" or "X" (str, required)
  • phecode_count_file_path: Path to phecode counts file (str, required)
  • cohort_file_path: Path to cohort file with covariates (str, required)
  • covariate_cols: List of covariate column names (list[str], required)
  • independent_variable_of_interest: Name of primary variable column (str, required)
  • sex_at_birth_col: Name of sex column with 0/1 values (str, required)
  • male_as_one: True if male=1, False if male=0 (bool, default: True)
  • icd_version: ICD version "US", "WHO", or "custom" (str, default: "US")
  • phecode_map_file_path: Path to custom phecode mapping file (str, optional)
  • phecode_to_process: Specific phecodes to analyze (list[str] or str, optional)
  • min_cases: Minimum cases required to test phecode (int, default: 50)
  • min_phecode_count: Minimum count to qualify as case (int, default: 2)
  • use_exclusion: Whether to use phecode exclusion ranges (bool, default: False)
  • method: "logit" for logistic or "cox" for Cox regression (str, default: "logit")
  • batch_size: Number of phecodes per processing batch (int, optional, default: 1 for logit, 10 for cox)
  • fall_back_to_serial: Fall back to serial processing if parallel fails (bool, default: False)
  • output_file_path: Output file path (str, optional)
  • verbose: Print progress for each phecode (bool, default: False)
  • suppress_warnings: Suppress convergence warnings (bool, default: True)

Notebook Example

from phetk.phewas import PheWAS

# Run PheWAS analysis
phewas = PheWAS(
    phecode_version="X",
    phecode_count_file_path="phecode_counts.tsv",
    cohort_file_path="cohort.tsv",
    covariate_cols=["age", "sex", "pc1", "pc2", "pc3"],
    independent_variable_of_interest="genotype",
    sex_at_birth_col="sex",
    min_cases=50,
    min_phecode_count=2,
    output_file_path="phewas_results.tsv"
)
phewas.run()

CLI Example

phetk phewas \
  --phecode_version "X" \
  --cohort_file_path "cohort.tsv" \
  --phecode_count_file_path "phecode_counts.tsv" \
  --sex_at_birth_col "sex" \
  --covariate_cols age sex pc1 pc2 pc3 \
  --independent_variable_of_interest "genotype" \
  --min_cases 50 \
  --min_phecode_count 2 \
  --output_file_path "phewas_results.tsv"

Cox Regression Parameters

Additional parameters for Cox proportional hazards regression:

  • cox_start_date_col: Column with start dates (str, optional); optional for Cox. Date to exclude participants with pre-existing phenotype from cases of a particular phecode.
  • cox_control_observed_time_col: Column with censoring time for controls (str, required)
  • cox_phecode_observed_time_col: Column with time to event for cases (str, required)
  • cox_stratification_col: Column for stratification (str, optional)
  • cox_fallback_step_size: Step size for convergence issues (float, optional, default: 0.1)

Notebook Example

phewas = PheWAS(
    phecode_version="X",
    phecode_count_file_path="phecode_counts.tsv",
    cohort_file_path="cohort.tsv",
    covariate_cols=["age", "sex", "pc1"],
    independent_variable_of_interest="exposure",
    sex_at_birth_col="sex",
    method="cox",
    cox_start_date_col="start_date",
    cox_control_observed_time_col="follow_up_time",
    cox_phecode_observed_time_col="time_to_event"
)
phewas.run()

CLI Example

phetk phewas \
  --phecode_version "X" \
  --cohort_file_path "cohort.tsv" \
  --phecode_count_file_path "phecode_counts.tsv" \
  --sex_at_birth_col "sex" \
  --covariate_cols age sex pc1 \
  --independent_variable_of_interest "exposure" \
  --min_cases 50 \
  --min_phecode_count 2 \
  --method "cox" \
  --cox_start_date_col "start_date" \
  --cox_control_observed_time_col "follow_up_time" \
  --cox_phecode_observed_time_col "time_to_event" \
  --output_file_path "cox_results.tsv"

Running PheWAS with dsub

Execute PheWAS analysis using Google Cloud dsub for distributed computing on cloud infrastructure.

NOTE: For Cox regression, a standard or highmem machine should be used. For logistic regression, any machine would work. For example, machine_type="c2d_highmem_4" for Cox regression and machine_type="c2d_highcpu_4" for logistic regression.

See dsub-considerations.md for detailed setup, parameter guidance, and useful utilities.

Key Parameters

  • docker_image: Docker image containing PheWAS dependencies (str, required)
  • job_script_name: Name of bash script to execute (str, default: "phewas_script.sh")
  • job_name: Custom name for dsub job (str, optional)
  • input_dict: Mapping of input variables to cloud storage paths (dict, optional)
  • output_dict: Mapping of output variables to cloud storage paths (dict, optional)
  • env_dict: Environment variables to set in job (dict, optional)
  • machine_type: Google Cloud machine type (str, default: "c2d-highcpu-4")
  • boot_disk_size: Size of boot disk in GB (int, default: 50)
  • disk_size: Size of additional disk in GB (int, default: 256)
  • region: Google Cloud region for execution (str, default: "us-central1")
  • provider: Cloud provider backend (str, default: "google-batch")
  • preemptible: Whether to use preemptible instances (bool, default: False)
  • use_private_address: Whether to use private IP addresses (bool, default: True)

Notebook Example

from phetk.phewas import PheWAS

# Create PheWAS instance
phewas = PheWAS(
    phecode_version="X",
    phecode_count_file_path="gs://your-bucket/phecode_counts.tsv",
    cohort_file_path="gs://your-bucket/cohort.tsv",
    covariate_cols=["age", "sex", "pc1", "pc2", "pc3"],
    independent_variable_of_interest="genotype",
    sex_at_birth_col="sex",
    min_cases=50,
    min_phecode_count=2,
    method="logit",
    output_file_path="gs://your-bucket/phewas_results.tsv"
)

# Run with dsub
phewas.run_dsub(
    docker_image="phetk/phetk:latest",
    job_name="my-phewas-job",
    machine_type="c2d-standard-4",
    region="us-central1",
    preemptible=True
)

Advanced Options

Process Specific Phecodes

# Single phecode
phecode_to_process="185"

# Multiple phecodes
phecode_to_process=["185", "250.2", "401.1"]

Phecode Exclusion (1.2 only)

use_exclusion=True  # Apply phecode exclusion ranges

Custom Phecode Mapping

icd_version="custom"
phecode_map_file_path="path/to/custom_mapping.tsv"

Parallel Processing

batch_size=10  # Process 10 phecodes per batch
fall_back_to_serial=True  # Use serial if parallel fails

Get Phecode Data

Retrieve cohort data for specific phecode after running PheWAS:

# Get data for phecode "185"
phecode_data = phewas.get_phecode_data("185")

Returns dataframe with original cohort data plus is_phecode_case column indicating case/control status.

Output Format

Results file contains:

  • phecode: Phecode tested
  • phecode_string: Phecode description
  • beta/hazard_ratio: Effect estimate
  • SE: Standard error
  • p_value: P-value from regression
  • n_cases: Number of cases
  • n_controls: Number of controls
  • converged: Whether regression converged
  • phecode_sex: Sex restriction if applicable

Important Notes

  • Sex column must contain 0/1 values only
  • Include sex in covariate_cols if using as covariate
  • Non-converged results are kept and flagged in converged column
  • Minimum case/control requirements prevent spurious associations