All of Us Long Read Phase 1 Workflows
This repository contains all of the reproducible WDL workflows used in Phase 1 of the All of Us Long Read (AoU-LR) project. These workflows cover various steps of long-read genomic analysis and are provided for transparency and reuse.
Please note: this code's organization is in flux.
Dockstore Workflows (Organized by Functionality)
1. Read and Assembly Processing
Workflow
Location
Description
HiFiBamToFastQ
Dockstore
Converts PacBio HiFi BAM files to FASTQ format for downstream analysis.
MergeFastqs
Dockstore
Merges multiple FASTQ files into a single set per sample.
PBAssembleWithHifiasm
Dockstore
Assembles PacBio HiFi reads using Hifiasm.
MapAssemblyContigs
Dockstore
Aligns assembly contigs to a reference to validate assemblies or identify SVs.
EvaluateAssemblyHap
Dockstore
Assesses haplotype-resolved assemblies against reference sequences.
2. Small Variant Calling & Summaries
Workflow
Location
Description
1074.T2T.SmallVariantsBasicMetrics
Dockstore
Computes metrics for small variants on T2T reference.
PBCCSWholeGenome
Dockstore
Calls small variants genome-wide from PacBio CCS reads.
SummarizeDVPSmallVariants
Dockstore
Summarizes variants from DeepVariant+Pepper pipeline.
SummarizePAVSmallVariants
Dockstore
Summarizes small variants discovered in PAV contexts.
3. Structural Variant (SV) Discovery & Integration
Workflow
Location
Description
PAV
Dockstore
Detects presence/absence variants (large insertions/deletions).
PAV2SVs
Dockstore
Converts PAV results to standard SV calls.
LRMergeSVVCFs
Dockstore
Merges multiple SV callsets into one.
TruvariCollapse
Dockstore
Collapses duplicate/equivalent SVs into consensus calls.
TruvariIntersample
Dockstore
Compares SVs between samples.
TruvariIntrasample
Dockstore
Compares SVs within a single sample.
SummarizePAVSVs
Dockstore
Summarizes PAV structural variant calls.
SummarizeSnifflesSVs
Dockstore
Summarizes SVs called by Sniffles.
GraphEvaluation
Dockstore
Builds and evaluates SV overlap graphs.
4. Joint Calling & Cohort Integration
Workflow
Location
Description
JointCalling
Dockstore
Jointly genotypes SVs across a cohort.
LRJointCallGVCFs
Dockstore
Joint genotyping of GVCFs into a cohort-wide VCF.
MergeVCFs
Dockstore
Merges multiple VCFs into one.
MergePhasedVCF
Dockstore
Merges phased VCFs into one.
MergeRegenotypedIntersampleVcf
Dockstore
Merges per-sample re-genotyped VCFs into cohort VCF.
MergeSVsSNPs
Dockstore
Combines SVs with SNVs/indels into one file.
OverlapGraph
Dockstore
Builds an overlap graph across callsets.
OverlapStats
Dockstore
Computes overlap statistics between callsets.
5. Long-Read Phasing and Imputation
Workflow
Location
Description
PhysicalPhasing
Dockstore
Physically phases SNVs/indels and SVs in a single sample with HiPhase.
ChromosomePhasedPanelCreationFromHiPhase
Dockstore
Per chromosome, performs statistical phasing and imputation of SNVs/indels and SVs in a cohort with SHAPEIT4, removes colliding variants, and creates a pangenome bubble-graph reference panel.
ConcatAndEvaluate
Dockstore
Concatenates per-chromosome pangenome bubble-graph reference panels and runs leave-out and Vcfdist evaluations.
6. Short-Read Genotyping, Phasing, and Imputation
Workflow
Location
Description
KAGEPanelWithPreprocessing
Dockstore
Per chromosome, creates a kmer index and count model for KAGE genotyping from a reference panel.
KAGECasePerChromosomeFlexscattered
Dockstore
Genotypes a single sample against a reference panel with KAGE.
GLIMPSEBatchedCasePerChromosomeSingleBatch
Dockstore
Performs phasing and imputation of a batch of genotyped samples against a reference panel with GLIMPSE.
HierarchicallyMergeVcfs
Dockstore
Hierarchically merges cohort VCFs using either bcftools or ivcfmerge.
7. Quality Control & Fingerprinting
Workflow
Location
Description
CollectSingleSampleSVvcfMetrics
Dockstore
Computes SV metrics per sample.
LongReadsContaminationEstimation
Dockstore
Estimates contamination in long-read data.
BuildTempLocalFpStore
Dockstore
Builds temporary fingerprint store for identity checks.
VerifyFingerprintCCSSample
Dockstore
Verifies CCS sample identity by fingerprinting.
SexCheck
Dockstore
Checks reported vs genetic sex.
MainVcfQc
Dockstore
Runs quality control checks on final VCFs.
This repository also contains Jupyter notebooks for data analysis and visualization, organized by platform:
Terra Notebooks (notebooks/terra/)
These notebooks are designed to run in the Terra cloud platform and focus on data processing, analysis, and quality control:
Data Import and Processing
Notebook
Link
Description
main_init_subset_vds.ipynb
GitHub
Initialize and subset Variant Dataset (VDS) for analysis
Notebook
Link
Description
kvg_examine_assemblies.ipynb
GitHub
Examine and analyze genome assemblies
kvg_study_read_length_dists.ipynb
GitHub
Study read length distributions from sequencing data
Notebook
Link
Description
kvg_examine_small_variants.ipynb
GitHub
Analyze small variants (SNPs, indels)
kvg_examine_structural_variants.ipynb
GitHub
Examine structural variants (SVs)
kvg_sv_callset_inventory.ipynb
GitHub
Inventory and catalog structural variant callsets
kvg_describe_hail_matrix_tables.ipynb
GitHub
Describe Hail matrix tables for genomic data
Population Genetics and Statistics
Notebook
Link
Description
kvg_pca.ipynb
GitHub
Principal Component Analysis for population structure
kvg_pca_hgdp_tgp.ipynb
GitHub
PCA analysis incorporating HGDP and TGP reference populations
kvg_recompute_relatedness.ipynb
GitHub
Recompute relatedness estimates between samples
kvg_compute_sfs_grch38.ipynb
GitHub
Compute Site Frequency Spectrum on GRCh38 reference
kvg_firth_logistic_regression.ipynb
GitHub
Firth logistic regression analysis
Phasing and Panel Analysis
Notebook
Link
Description
hangsu_hiphase_results.ipynb
GitHub
Analysis of HIPHASE phasing results
Notebook
Link
Description
ym_callset_QC_py.ipynb
GitHub
Python-based callset quality control
ym_callset_QC_R.ipynb
GitHub
R-based callset quality control
Manuscript Figures and Tables
Notebook
Link
Description
main_figure_01_pca.ipynb
GitHub
Generate PCA figure for main manuscript
main_table_02_sv_summary.ipynb
GitHub
Generate structural variant summary table
main_table_02_variant_inventory.ipynb
GitHub
Generate variant inventory table
Researcher Workbench Notebooks (notebooks/rw/)
These notebooks are designed to run in the All of Us Researcher Workbench and focus on manuscript figures, tables, and specialized analyses:
Notebook
Link
Description
main_figure_01_length_distributions.ipynb
GitHub
Generate read and contig length distribution figures
main_figure_01_map.ipynb
GitHub
Generate map figure for manuscript
main_figure_01_omop.ipynb
GitHub
Generate OMOP-related figure
main_figure_01_pca.ipynb
GitHub
Generate PCA figure for main manuscript
supp_figure_01_assembly.ipynb
GitHub
Generate supplementary assembly figure
Notebook
Link
Description
main_table_01_dataset_summary.ipynb
GitHub
Generate dataset summary table
main_table_02_short_read_svs.ipynb
GitHub
Generate short read structural variant table
main_table_02_variant_inventory.ipynb
GitHub
Generate variant inventory table
Notebook
Link
Description
init_subset_vds.ipynb
GitHub
Initialize and subset Variant Dataset
JW_CYP2D6.ipynb
GitHub
CYP2D6 gene analysis
JW_repeat_expansion_figures.ipynb
GitHub
Repeat expansion analysis and figures
kvg_firth_logistic_regression.ipynb
GitHub
Firth logistic regression analysis
kvg_pmi_skip_participants.ipynb
GitHub
PMI participant filtering analysis
LR_SV_disease_associations.ipynb
GitHub
SV-disease association analysis in 1,027 All of Us Phase 1 samples