Skip to content

Commit 1204f1a

Browse files
committed
Update README
1 parent 2714c76 commit 1204f1a

File tree

2 files changed

+123
-48
lines changed

2 files changed

+123
-48
lines changed

README.md

Lines changed: 122 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,110 +1,184 @@
1-
# TOMM40_WGS: Genotyping TOMM40 '523 Poly-T Polymorphisms from Whole-Genome Sequencing
1+
# TOMM40_WGS: Genotyping TOMM40'523 Poly-T Polymorphisms from Whole-Genome Sequencing
22

3-
TOMM40_WGS is a Snakemake pipeline that enables automated genotyping of the TOMM40 '523 poly-T repeat from WGS data. It integrates STR genotyping tools, k-mer analysis, and machine learning to provide accurate repeat length and genotype predictions.
3+
`TOMM40_WGS` is a Snakemake pipeline that enables automated genotyping of the TOMM40'523 poly-T repeat from WGS data. It integrates STR genotyping tools, k-mer analysis, and machine learning to provide accurate repeat length and genotype predictions.
44

55
---
66

77
## Background
88

9-
The TOMM40 '523 poly-T repeat polymorphism (rs10524523) is a variable-length poly-T repeat located in intron 6 of the TOMM40 gene on chromosome 19. This polymorphism has been associated with age of onset of Alzheimer's disease (AD) and cognitive decline.
9+
The TOMM40'523 poly-T repeat polymorphism (`rs10524523`) is a variable-length poly-T repeat located in intron 6 of the *TOMM40* gene on chromosome 19. This polymorphism has been associated with age of onset of Alzheimer's disease (AD) and cognitive decline.
1010

1111
The repeat length is typically categorized into three classes:
12-
- **Short (S)**: ≤19 T residues
13-
- **Long (L)**: 20-29 T residues
14-
- **Very Long (VL)**: ≥30 T residues
12+
13+
* **Short (S)**: ≤19 T residues
14+
* **Long (L)**: 20-29 T residues
15+
* **Very Long (VL)**: ≥30 T residues
16+
17+
---
1518

1619
## 🚀 Quick Start
1720

21+
### Software Dependencies
22+
23+
- **Conda**: install via [Miniforge](https://conda-forge.org/download/) (recommended)
24+
25+
### Create a conda environment
26+
```bash
27+
conda config --set channel_priority strict
28+
conda create -y -n TOMM40_WGS_env -c conda-forge -c bioconda \
29+
snakemake==9.3.0 samtools==1.21 pandas==2.2.3
30+
conda activate TOMM40_WGS_env
31+
```
32+
1833
### Clone the repository and install dependencies
1934
```bash
2035
git clone https://github.com/RushAlz/TOMM40_WGS.git
2136
cd TOMM40_WGS
2237
chmod u+x TOMM40_WGS
2338

24-
# Pre-install dependencies (make sure to have snakemake and conda installed first)
25-
./TOMM40_WGS --conda-create-envs-only
39+
# Get help
40+
./TOMM40_WGS --help
2641
```
2742

2843
### (soon) Alternatively: Install with bioconda
2944
```bash
30-
conda install -c bioconda tomm40_wgs
45+
conda install -c conda-forge -c bioconda tomm40_wgs
3146
```
3247

3348
Use the `TOMM40_WGS` launcher script to run the pipeline with minimal setup.
3449

35-
### Software Dependencies
36-
37-
- **R**: 4.0.0 or higher
38-
- **Conda**
50+
---
3951

4052
### Basic usage (recommended)
4153

4254
```bash
4355
# Basic usage with a single BAM file (aligned to GRCh38 by default)
44-
TOMM40_WGS --input_wgs sample.bam --ref_fasta GRCh38.fa --cores 8
56+
TOMM40_WGS --input_wgs sample.bam --ref_fasta GRCh38.fa --cores 2
4557

4658
# Basic usage with multiple BAM files using glob pattern (quotes are required)
47-
TOMM40_WGS --input_wgs "*.bam" --ref_fasta GRCh38.fa --cores 8
59+
TOMM40_WGS --input_wgs "*.bam" --ref_fasta GRCh38.fa --cores 2
4860

4961
# Basic usage with CRAM files
50-
TOMM40_WGS --input_wgs "*.cram" --ref_fasta GRCh38.fa --genome_build GRCh38 --cores 8
62+
TOMM40_WGS --input_wgs "*.cram" --ref_fasta GRCh38.fa --genome_build GRCh38 --cores 2
5163

5264
# Using a sample table (TSV file with sample_id and path columns)
53-
TOMM40_WGS --input_table samples.tsv --ref_fasta GRCh38.fa --cores 8
65+
TOMM40_WGS --input_table samples.tsv --ref_fasta GRCh38.fa --cores 2
5466

5567
# With a custom configuration file
56-
TOMM40_WGS --configfile custom_config.yaml --cores 8
68+
TOMM40_WGS --configfile custom_config.yaml --cores 2
69+
```
5770

58-
# To pre-create environments
59-
TOMM40_WGS --conda-create-envs-only
71+
This script wraps Snakemake, applies default values when needed, and accepts any extra Snakemake flags (e.g., `--dryrun`).
6072

61-
# Get help
62-
TOMM40_WGS --help
73+
---
74+
75+
### End-to-end testing with real data (data from 1000 Genomes Project)
76+
77+
[Reference for the data](https://doi.org/10.1016/j.cell.2022.08.004)
78+
79+
> Install Miniforge3
80+
81+
```bash
82+
# Skip if conda is already installed
83+
wget -O Miniforge3.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
84+
bash Miniforge3.sh -b -p "${HOME}/conda"
85+
source "${HOME}/conda/etc/profile.d/conda.sh"
86+
```
87+
88+
> Create TOMM40_WGS_env
89+
90+
```bash
91+
# Skip if already created
92+
conda config --set channel_priority strict
93+
conda create -y -n TOMM40_WGS_env -c conda-forge -c bioconda \
94+
snakemake==9.3.0 samtools==1.21 pandas==2.2.3
95+
conda activate TOMM40_WGS_env
6396
```
6497

65-
This script wraps Snakemake, applies default values when needed, and accepts any extra Snakemake flags (e.g., `--dryrun`, `--printshellcmds`, `--timestamp`).
98+
> Clone the repository and create a testing directory
6699
67-
### Testing with real data (data from 1000 Genomes Project)
68100
```bash
101+
git clone https://github.com/RushAlz/TOMM40_WGS.git
102+
chmod u+x TOMM40_WGS/TOMM40_WGS
103+
69104
# Make a temporary working directory
70105
mkdir -p tomm40_wgs_test
71106
cd tomm40_wgs_test
107+
```
108+
109+
> Test TOMM40_WGS with GRCh38 CRAMs
72110
111+
```bash
73112
# Download the reference genome
74113
REF_FASTA="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa"
75114
REF_FASTA_LOCAL=$PWD/"GRCh38_full_analysis_set_plus_decoy_hla.fa"
76115
wget -O ${REF_FASTA_LOCAL} ${REF_FASTA}
77116
wget -O ${REF_FASTA_LOCAL}.fai ${REF_FASTA}.fai
78117

79118
# Subset the CRAM file to the TOMM40 region
80-
CRAMFILE_1="https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240157/HG00157.final.cram"
81-
CRAM_1_LOCAL=$PWD/"HG00157.final.cram"
119+
CRAMFILE_1="https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240114/HG00096.final.cram"
120+
CRAM_1_LOCAL=$PWD/"HG00096.GRCh38.cram"
82121
samtools view -h -T GRCh38_full_analysis_set_plus_decoy_hla.fa -C ${CRAMFILE_1} "chr19:44399792-45399826" > ${CRAM_1_LOCAL}
83122
samtools index ${CRAM_1_LOCAL}
84123

85124
# Download a second CRAM file for testing
86-
CRAMFILE_2="https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240160/HG00171.final.cram"
87-
CRAM_2_LOCAL=$PWD/"HG00171.final.cram"
125+
CRAMFILE_2="https://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram"
126+
CRAM_2_LOCAL=$PWD/"NA12878.GRCh38.cram"
88127
samtools view -h -T GRCh38_full_analysis_set_plus_decoy_hla.fa -C ${CRAMFILE_2} "chr19:44399792-45399826" > ${CRAM_2_LOCAL}
89128
samtools index ${CRAM_2_LOCAL}
90129

91130
# Run the pipeline specifying the input files as a glob (quotes are required)
92-
TOMM40_WGS --input_wgs "*.cram" \
131+
../TOMM40_WGS/TOMM40_WGS --input_wgs "*.cram" \
132+
--ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
133+
--output_dir results_GRCh38 \
134+
--cores 2
135+
136+
# Alternatively, run the pipeline specifying the input files as a TSV file
137+
echo -e "sample_id\tpath\nHG00096\t${CRAM_1_LOCAL}\nNA12878\t${CRAM_2_LOCAL}" > samples.tsv
138+
../TOMM40_WGS/TOMM40_WGS --input_table samples.tsv \
93139
--ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
94-
--cores 8
140+
--output_dir results_GRCh38 \
141+
--cores 2
95142

96143
# Check results
97-
cat results/predictions/tomm40_predictions_summary.tsv
144+
cat results_GRCh38/predictions/tomm40_predictions_summary.tsv | column -t
145+
```
146+
147+
> Test TOMM40_WGS with b37 BAMs
148+
149+
```bash
150+
# Download the reference genome
151+
REF_FASTA="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
152+
REF_FAI="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai"
153+
wget ${REF_FASTA}
154+
wget ${REF_FAI}
155+
REF_FASTA_LOCAL=$PWD/"hs37d5.fa.gz"
156+
157+
# Subset the BAM file to the TOMM40 region
158+
BAMFILE_1="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/high_coverage_alignment/HG00096.wgs.ILLUMINA.bwa.GBR.high_cov_pcr_free.20140203.bam"
159+
BAM_1_LOCAL=$PWD/"HG00096.hg37.bam"
160+
samtools view -h -b ${BAMFILE_1} "19:44903049-45903083" > ${BAM_1_LOCAL}.tmp
161+
samtools addreplacerg -r '@RG\tID:HG00096\tSM:HG00096\tLB:lib1' -o ${BAM_1_LOCAL} ${BAM_1_LOCAL}.tmp
162+
samtools index ${BAM_1_LOCAL}
163+
164+
# Download a second CRAM file for testing
165+
BAMFILE_2="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/high_coverage_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam"
166+
BAM_2_LOCAL=$PWD/"NA12878.hg37.bam"
167+
samtools view -h -b ${BAMFILE_2} "19:44903049-45903083" > ${BAM_2_LOCAL}.tmp
168+
samtools addreplacerg -r '@RG\tID:NA12878\tSM:NA12878\tLB:lib1' -o ${BAM_2_LOCAL} ${BAM_2_LOCAL}.tmp
169+
170+
samtools index ${BAM_2_LOCAL}
171+
172+
# Run the pipeline specifying the input files as a glob (quotes are required)
173+
../TOMM40_WGS/TOMM40_WGS --input_wgs "*.bam" \
174+
--ref_fasta hs37d5.fa.gz \
175+
--genome_build b37 \
176+
--output_dir results_b37 \
177+
--cores 2
98178

99-
# Test inputs as TSV file
100-
echo -e "sample_id\tpath\nHG00157\t${CRAM_1_LOCAL}\nHG00171\t${CRAM_2_LOCAL}" > samples.tsv
101-
# Run the pipeline specifying the input files as a TSV file
102-
TOMM40_WGS --input_table samples.tsv \
103-
--ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
104-
--cores 8
105179

106180
# Check results
107-
cat results/predictions/tomm40_predictions_summary.tsv
181+
cat results_b37/predictions/tomm40_predictions_summary.tsv | column -t
108182
```
109183

110184
---
@@ -130,7 +204,7 @@ The pipeline performs the following steps for each input sample:
130204
Input data can be specified in two ways:
131205

132206
1. **Using command line parameters**:
133-
- `--input_wgs`: A single file, a glob pattern (must be quoted, e.g., `"*.cram"`), or a space-separated list of files
207+
- `--input_wgs`: A single file, a glob pattern (must be quoted, e.g., `"*.cram"`)
134208
- `--input_table`: Path to a TSV file with `sample_id` and `path` columns
135209

136210
2. **Using configuration file**:
@@ -147,9 +221,8 @@ sample1 /path/to/sample1.bam
147221
sample2 /path/to/sample2.bam
148222
```
149223

150-
**Important** currently only GRCh38 is supported. Future configuration options:
151-
152224
- `genome_build`: Specify the genome build according to the reference genome used for alignment. The default is `GRCh38`.
225+
153226
Supported builds include:
154227

155228
- GRCh38
@@ -158,13 +231,15 @@ Supported builds include:
158231
- b37
159232

160233
```yaml
161-
# TOMM40`523 region mapping to genome builds
162-
"GRCh38": "chr19:44399792-45399826"
163-
"GRCh37": "chr19:44399780-45399837"
164-
"hg19": "chr19:44399780-45399837"
165-
"b37": "19:44399780-45399837"
234+
# TOMM40`523 (rs10524523) coordinates mapping to genome builds
235+
"GRCh38": "chr19:44899792-44899826"
236+
"GRCh37": "chr19:45403049-45403083"
237+
"hg19": "chr19:45403049-45403083"
238+
"b37": "19:45403049-45403083"
166239
```
167240
241+
**Important** currently only GRCh38 has been extensively tested.
242+
168243
---
169244
170245
## Outputs
@@ -187,6 +262,6 @@ Important licensing information is available [here](docs/license_instructions.md
187262
---
188263

189264
## Citation
190-
If you use this pipeline, please cite:
265+
If you use TOMM40_WGS, please cite:
191266

192267
> Vialle RA et al. (2025). *Genotyping TOMM40'523 Poly-T Polymorphisms Using Whole-Genome Sequencing*. [medRxiv 2025.04.23.25326276; doi: https://doi.org/10.1101/2025.04.23.25326276](https://doi.org/10.1101/2025.04.23.25326276)

Snakefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ region_mapping = {
3535
"GRCh38": "chr19:44399792-45399826",
3636
"GRCh37": "chr19:44903049-45903083",
3737
"hg19": "chr19:44903049-45903083",
38-
"b37": "19:44903049-45903083"
38+
"b37": "19:44903049-45903083"
3939
}
4040

4141
def cram_lookup(wildcards):

0 commit comments

Comments
 (0)