1- # TOMM40_WGS: Genotyping TOMM40 '523 Poly-T Polymorphisms from Whole-Genome Sequencing
1+ # TOMM40_WGS: Genotyping TOMM40'523 Poly-T Polymorphisms from Whole-Genome Sequencing
22
3- TOMM40_WGS is a Snakemake pipeline that enables automated genotyping of the TOMM40 '523 poly-T repeat from WGS data. It integrates STR genotyping tools, k-mer analysis, and machine learning to provide accurate repeat length and genotype predictions.
3+ ` TOMM40_WGS ` is a Snakemake pipeline that enables automated genotyping of the TOMM40'523 poly-T repeat from WGS data. It integrates STR genotyping tools, k-mer analysis, and machine learning to provide accurate repeat length and genotype predictions.
44
55---
66
77## Background
88
9- The TOMM40 '523 poly-T repeat polymorphism (rs10524523) is a variable-length poly-T repeat located in intron 6 of the TOMM40 gene on chromosome 19. This polymorphism has been associated with age of onset of Alzheimer's disease (AD) and cognitive decline.
9+ The TOMM40'523 poly-T repeat polymorphism (` rs10524523 ` ) is a variable-length poly-T repeat located in intron 6 of the * TOMM40* gene on chromosome 19. This polymorphism has been associated with age of onset of Alzheimer's disease (AD) and cognitive decline.
1010
1111The repeat length is typically categorized into three classes:
12- - ** Short (S)** : ≤19 T residues
13- - ** Long (L)** : 20-29 T residues
14- - ** Very Long (VL)** : ≥30 T residues
12+
13+ * ** Short (S)** : ≤19 T residues
14+ * ** Long (L)** : 20-29 T residues
15+ * ** Very Long (VL)** : ≥30 T residues
16+
17+ ---
1518
1619## 🚀 Quick Start
1720
21+ ### Software Dependencies
22+
23+ - ** Conda** : install via [ Miniforge] ( https://conda-forge.org/download/ ) (recommended)
24+
25+ ### Create a conda environment
26+ ``` bash
27+ conda config --set channel_priority strict
28+ conda create -y -n TOMM40_WGS_env -c conda-forge -c bioconda \
29+ snakemake==9.3.0 samtools==1.21 pandas==2.2.3
30+ conda activate TOMM40_WGS_env
31+ ```
32+
1833### Clone the repository and install dependencies
1934``` bash
2035git clone https://github.com/RushAlz/TOMM40_WGS.git
2136cd TOMM40_WGS
2237chmod u+x TOMM40_WGS
2338
24- # Pre-install dependencies (make sure to have snakemake and conda installed first)
25- ./TOMM40_WGS --conda-create-envs-only
39+ # Get help
40+ ./TOMM40_WGS --help
2641```
2742
2843### (soon) Alternatively: Install with bioconda
2944``` bash
30- conda install -c bioconda tomm40_wgs
45+ conda install -c conda-forge -c bioconda tomm40_wgs
3146```
3247
3348Use the ` TOMM40_WGS ` launcher script to run the pipeline with minimal setup.
3449
35- ### Software Dependencies
36-
37- - ** R** : 4.0.0 or higher
38- - ** Conda**
50+ ---
3951
4052### Basic usage (recommended)
4153
4254``` bash
4355# Basic usage with a single BAM file (aligned to GRCh38 by default)
44- TOMM40_WGS --input_wgs sample.bam --ref_fasta GRCh38.fa --cores 8
56+ TOMM40_WGS --input_wgs sample.bam --ref_fasta GRCh38.fa --cores 2
4557
4658# Basic usage with multiple BAM files using glob pattern (quotes are required)
47- TOMM40_WGS --input_wgs " *.bam" --ref_fasta GRCh38.fa --cores 8
59+ TOMM40_WGS --input_wgs " *.bam" --ref_fasta GRCh38.fa --cores 2
4860
4961# Basic usage with CRAM files
50- TOMM40_WGS --input_wgs " *.cram" --ref_fasta GRCh38.fa --genome_build GRCh38 --cores 8
62+ TOMM40_WGS --input_wgs " *.cram" --ref_fasta GRCh38.fa --genome_build GRCh38 --cores 2
5163
5264# Using a sample table (TSV file with sample_id and path columns)
53- TOMM40_WGS --input_table samples.tsv --ref_fasta GRCh38.fa --cores 8
65+ TOMM40_WGS --input_table samples.tsv --ref_fasta GRCh38.fa --cores 2
5466
5567# With a custom configuration file
56- TOMM40_WGS --configfile custom_config.yaml --cores 8
68+ TOMM40_WGS --configfile custom_config.yaml --cores 2
69+ ```
5770
58- # To pre-create environments
59- TOMM40_WGS --conda-create-envs-only
71+ This script wraps Snakemake, applies default values when needed, and accepts any extra Snakemake flags (e.g., ` --dryrun ` ).
6072
61- # Get help
62- TOMM40_WGS --help
73+ ---
74+
75+ ### End-to-end testing with real data (data from 1000 Genomes Project)
76+
77+ [ Reference for the data] ( https://doi.org/10.1016/j.cell.2022.08.004 )
78+
79+ > Install Miniforge3
80+
81+ ``` bash
82+ # Skip if conda is already installed
83+ wget -O Miniforge3.sh " https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$( uname) -$( uname -m) .sh"
84+ bash Miniforge3.sh -b -p " ${HOME} /conda"
85+ source " ${HOME} /conda/etc/profile.d/conda.sh"
86+ ```
87+
88+ > Create TOMM40_WGS_env
89+
90+ ``` bash
91+ # Skip if already created
92+ conda config --set channel_priority strict
93+ conda create -y -n TOMM40_WGS_env -c conda-forge -c bioconda \
94+ snakemake==9.3.0 samtools==1.21 pandas==2.2.3
95+ conda activate TOMM40_WGS_env
6396```
6497
65- This script wraps Snakemake, applies default values when needed, and accepts any extra Snakemake flags (e.g., ` --dryrun ` , ` --printshellcmds ` , ` --timestamp ` ).
98+ > Clone the repository and create a testing directory
6699
67- ### Testing with real data (data from 1000 Genomes Project)
68100``` bash
101+ git clone https://github.com/RushAlz/TOMM40_WGS.git
102+ chmod u+x TOMM40_WGS/TOMM40_WGS
103+
69104# Make a temporary working directory
70105mkdir -p tomm40_wgs_test
71106cd tomm40_wgs_test
107+ ```
108+
109+ > Test TOMM40_WGS with GRCh38 CRAMs
72110
111+ ``` bash
73112# Download the reference genome
74113REF_FASTA=" https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa"
75114REF_FASTA_LOCAL=$PWD /" GRCh38_full_analysis_set_plus_decoy_hla.fa"
76115wget -O ${REF_FASTA_LOCAL} ${REF_FASTA}
77116wget -O ${REF_FASTA_LOCAL} .fai ${REF_FASTA} .fai
78117
79118# Subset the CRAM file to the TOMM40 region
80- CRAMFILE_1=" https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240157/HG00157 .final.cram"
81- CRAM_1_LOCAL=$PWD /" HG00157.final .cram"
119+ CRAMFILE_1=" https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240114/HG00096 .final.cram"
120+ CRAM_1_LOCAL=$PWD /" HG00096.GRCh38 .cram"
82121samtools view -h -T GRCh38_full_analysis_set_plus_decoy_hla.fa -C ${CRAMFILE_1} " chr19:44399792-45399826" > ${CRAM_1_LOCAL}
83122samtools index ${CRAM_1_LOCAL}
84123
85124# Download a second CRAM file for testing
86- CRAMFILE_2=" https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240160/HG00171 .final.cram"
87- CRAM_2_LOCAL=$PWD /" HG00171.final .cram"
125+ CRAMFILE_2=" https://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878 .final.cram"
126+ CRAM_2_LOCAL=$PWD /" NA12878.GRCh38 .cram"
88127samtools view -h -T GRCh38_full_analysis_set_plus_decoy_hla.fa -C ${CRAMFILE_2} " chr19:44399792-45399826" > ${CRAM_2_LOCAL}
89128samtools index ${CRAM_2_LOCAL}
90129
91130# Run the pipeline specifying the input files as a glob (quotes are required)
92- TOMM40_WGS --input_wgs " *.cram" \
131+ ../TOMM40_WGS/TOMM40_WGS --input_wgs " *.cram" \
132+ --ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
133+ --output_dir results_GRCh38 \
134+ --cores 2
135+
136+ # Alternatively, run the pipeline specifying the input files as a TSV file
137+ echo -e " sample_id\tpath\nHG00096\t${CRAM_1_LOCAL} \nNA12878\t${CRAM_2_LOCAL} " > samples.tsv
138+ ../TOMM40_WGS/TOMM40_WGS --input_table samples.tsv \
93139--ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
94- --cores 8
140+ --output_dir results_GRCh38 \
141+ --cores 2
95142
96143# Check results
97- cat results/predictions/tomm40_predictions_summary.tsv
144+ cat results_GRCh38/predictions/tomm40_predictions_summary.tsv | column -t
145+ ```
146+
147+ > Test TOMM40_WGS with b37 BAMs
148+
149+ ``` bash
150+ # Download the reference genome
151+ REF_FASTA=" https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
152+ REF_FAI=" https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai"
153+ wget ${REF_FASTA}
154+ wget ${REF_FAI}
155+ REF_FASTA_LOCAL=$PWD /" hs37d5.fa.gz"
156+
157+ # Subset the BAM file to the TOMM40 region
158+ BAMFILE_1=" https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/high_coverage_alignment/HG00096.wgs.ILLUMINA.bwa.GBR.high_cov_pcr_free.20140203.bam"
159+ BAM_1_LOCAL=$PWD /" HG00096.hg37.bam"
160+ samtools view -h -b ${BAMFILE_1} " 19:44903049-45903083" > ${BAM_1_LOCAL} .tmp
161+ samtools addreplacerg -r ' @RG\tID:HG00096\tSM:HG00096\tLB:lib1' -o ${BAM_1_LOCAL} ${BAM_1_LOCAL} .tmp
162+ samtools index ${BAM_1_LOCAL}
163+
164+ # Download a second CRAM file for testing
165+ BAMFILE_2=" https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/high_coverage_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam"
166+ BAM_2_LOCAL=$PWD /" NA12878.hg37.bam"
167+ samtools view -h -b ${BAMFILE_2} " 19:44903049-45903083" > ${BAM_2_LOCAL} .tmp
168+ samtools addreplacerg -r ' @RG\tID:NA12878\tSM:NA12878\tLB:lib1' -o ${BAM_2_LOCAL} ${BAM_2_LOCAL} .tmp
169+
170+ samtools index ${BAM_2_LOCAL}
171+
172+ # Run the pipeline specifying the input files as a glob (quotes are required)
173+ ../TOMM40_WGS/TOMM40_WGS --input_wgs " *.bam" \
174+ --ref_fasta hs37d5.fa.gz \
175+ --genome_build b37 \
176+ --output_dir results_b37 \
177+ --cores 2
98178
99- # Test inputs as TSV file
100- echo -e " sample_id\tpath\nHG00157\t${CRAM_1_LOCAL} \nHG00171\t${CRAM_2_LOCAL} " > samples.tsv
101- # Run the pipeline specifying the input files as a TSV file
102- TOMM40_WGS --input_table samples.tsv \
103- --ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
104- --cores 8
105179
106180# Check results
107- cat results /predictions/tomm40_predictions_summary.tsv
181+ cat results_b37 /predictions/tomm40_predictions_summary.tsv | column -t
108182```
109183
110184---
@@ -130,7 +204,7 @@ The pipeline performs the following steps for each input sample:
130204Input data can be specified in two ways:
131205
1322061 . ** Using command line parameters** :
133- - ` --input_wgs ` : A single file, a glob pattern (must be quoted, e.g., ` "*.cram" ` ), or a space-separated list of files
207+ - ` --input_wgs ` : A single file, a glob pattern (must be quoted, e.g., ` "*.cram" ` )
134208 - ` --input_table ` : Path to a TSV file with ` sample_id ` and ` path ` columns
135209
1362102 . ** Using configuration file** :
@@ -147,9 +221,8 @@ sample1 /path/to/sample1.bam
147221sample2 /path/to/sample2.bam
148222```
149223
150- ** Important** currently only GRCh38 is supported. Future configuration options:
151-
152224- ` genome_build ` : Specify the genome build according to the reference genome used for alignment. The default is ` GRCh38 ` .
225+
153226Supported builds include:
154227
155228 - GRCh38
@@ -158,13 +231,15 @@ Supported builds include:
158231 - b37
159232
160233``` yaml
161- # TOMM40`523 region mapping to genome builds
162- " GRCh38 " : " chr19:44399792-45399826 "
163- " GRCh37 " : " chr19:44399780-45399837 "
164- " hg19 " : " chr19:44399780-45399837 "
165- " b37 " : " 19:44399780-45399837 "
234+ # TOMM40`523 (rs10524523) coordinates mapping to genome builds
235+ " GRCh38 " : " chr19:44899792-44899826 "
236+ " GRCh37 " : " chr19:45403049-45403083 "
237+ " hg19 " : " chr19:45403049-45403083 "
238+ " b37 " : " 19:45403049-45403083 "
166239` ` `
167240
241+ **Important** currently only GRCh38 has been extensively tested.
242+
168243---
169244
170245## Outputs
@@ -187,6 +262,6 @@ Important licensing information is available [here](docs/license_instructions.md
187262---
188263
189264# # Citation
190- If you use this pipeline , please cite :
265+ If you use TOMM40_WGS , please cite :
191266
192267> Vialle RA et al. (2025). *Genotyping TOMM40'523 Poly-T Polymorphisms Using Whole-Genome Sequencing*. [medRxiv 2025.04.23.25326276; doi: https://doi.org/10.1101/2025.04.23.25326276](https://doi.org/10.1101/2025.04.23.25326276)
0 commit comments