SegmentQTL is a segmentation-aware molecular quantitative trait loci (molQTL) analysis tool designed for copy-number–driven cancers. It incorporates genomic segmentation data to improve QTL mapping accuracy by filtering out associations disrupted by structural variations. This approach prevents spurious signals caused by breakpoints, ensuring biologically meaningful genotype-phenotype associations.
SegmentQTL supports both nominal and permutation-based association testing, along with false discovery rate (FDR) correction. The tool efficiently processes large datasets, leveraging multi-core parallelization and supporting continuous genotype dosage data to enhance analysis precision.
Requiring preinstalled Python and pip (Python package installer).
git clone https://github.com/HautaniemiLab/SegmentQTL.git
cd SegmentQTL
# (Optional, but recommended) Create a virtual environment
python -m venv <my-venv>
source <my-venv>/bin/activate
pip install -r requirements.txtSegmentQTL is executed via the command line with various options to control input data, analysis modes, and computational resources. The key arguments are:
--mode- Specifies the analysis mode:
nominal: Perform nominal association testing.perm: Perform permutation-based testing.fdr: Apply FDR correction to existing results.
- Specifies the analysis mode:
--chromosome- Chromosome number (e.g.,
21orX). Supportschrprefix (e.g.,chr21).
- Chromosome number (e.g.,
--genotypes- Path to genotype data directory.
--quantifications- Path to CSV file containing phenotype quantifications (e.g., gene expression). Note: Provide file with quantification for whole genome. This is needed for reliable permutations even if SegmentQTL processes one chromosome at a time.
--covariates- Path to CSV file with sample level covariate data.
--copynumber- Path to CSV file with copy number data.
--segmentation- Path to segmentation file with breakpoint data.
--all_variants- Test all variants for a given phenotype. Provide a phenotype ID or use without a value to process all phenotypes.
--perm_method- Method used for permutation (
betaordirect).
- Method used for permutation (
--num_permutations- Number of permutations per phenotype (default:
8000).
- Number of permutations per phenotype (default:
--window- Window size in base pairs for cis-mapping (default:
1,000,000bp).
- Window size in base pairs for cis-mapping (default:
--num_cores- Number of CPU cores to use for parallel processing (default:
1).
- Number of CPU cores to use for parallel processing (default:
--out_dir- Directory where results are saved.
--fdr_out- File path for saving FDR-corrected results. Must have .csv file extension.
--plot_threshold- P-value threshold for generating plots (
-1disables plotting).
- P-value threshold for generating plots (
--plot_dir- Directory for saving generated plots.
SegmentQTL requires five main input files: genotypes, quantifications, covariates, copy number data, and segmentation information. Below are the required formats and examples for each input.
The --genotypes argument should point to a directory containing per-chromosome genotype files, typically named chr1.csv, chr2.csv, ..., chr22.csv, chrX.csv
Each file corresponds to one chromosome and contains genotype dosages for multiple samples.
See to compute genotype dosages.
ID: Variant identifier in the formatchr:pos:ref:alt(e.g.,chr8:123456:A:G).<sample1>,<sample2>, ...: Sample-specific dosage values. Dosages are continuous values between0and1.
| ID | sample1 | sample2 | sample3 |
|---|---|---|---|
| chr8:123456:A:G | 0.32 | 0.45 | 0.10 |
| chr8:123789:T:C | 0.76 | 0.88 | 0.34 |
| chr8:124000:G:T | 0.00 | 0.05 | 0.50 |
The --quantifications argument should point to a CSV file containing normalized phenotype levels (e.g., gene expression) for all samples across the genome.
chr: Chromosome where the phenotype is located (e.g.,chr1,chrX).start: Start position of the phenotype.end: End position of the phenotype.gene_id: Unique identifier for the phenotype (e.g., Ensembl gene ID).
<sample1>,<sample2>, ...: Normalized phenotype values per sample.
| chr | start | end | gene_id | sample1 | sample2 | sample3 |
|---|---|---|---|---|---|---|
| chr8 | 123000 | 124000 | ENSG00000123 | 1.21 | 0.98 | 1.34 |
| chr8 | 130000 | 132000 | ENSG00000456 | 0.87 | 1.05 | 0.92 |
Note: Provide quantifications for the entire genome, even if only one chromosome is analyzed at a time. This ensures correct permutation testing and FDR correction.
The --covariates argument should point to a CSV file containing covariate values for each sample. First row has n entries (samples); subsequent rows have n + 1 entries (covariate name + values).
- Row 1: Sample IDs only (e.g.,
sample1,sample2,sample3) - Row 2+: First cell is the covariate name, followed by values for each sample.
The --copynumber argument should point to a CSV file containing phenotype-level copy number values for each sample.
gene_id: Ensembl gene ID or equivalent identifier.
<sample1>,<sample2>, ...: Copy number values per sample.
| gene_id | sample1 | sample2 | sample3 |
|---|---|---|---|
| ENSG00000123 | 2.10 | 1.85 | 1.92 |
| ENSG00000456 | 1.75 | 2.30 | 2.00 |
The --segmentation argument should point to a CSV file with structural segmentation data for each sample. This is used to determine if a variant and gene are on the same intact genomic segment.
sample: Sample ID.chr: Chromosome identifier.startpos: Start coordinate of the segment.endpos: End coordinate of the segment.
| sample | chr | startpos | endpos |
|---|---|---|---|
| sample1 | chr8 | 100000 | 200000 |
| sample1 | chr8 | 200001 | 300000 |
| sample2 | chr8 | 120000 | 250000 |
The primary output file of SegmentQTL is a CSV containing gene-variant associations.
| Column Name | Description |
|---|---|
phenotype |
Phenotype identifier. |
variant |
Variant identifier. |
number_of_samples |
Effective number of samples used in the association test after the segment filtering. |
slope |
Estimated regression coefficient (effect size) for the genotype–phenotype association. |
slope_se |
Standard error of the slope estimate. |
nominal_p |
P-value from the nominal association test. |
p_adj |
Permutation adjusted p-value. |
chr |
Chromosome where the gene and variant are located. |
fdr |
FDR corrected p-value. |
These examples assume you're in the root of the SegmentQTL folder.
First, unzip the provided mock dataset:
unzip mock.zipRun a nominal association test for chromosome 8 using 4 CPU cores:
python -m segmentqtl --mode nominal --chromosome 8 --num_cores 4 \
--genotypes mock/genotypes --quantifications mock/quantifications.csv \
--covariates mock/covariates.csv --copynumber mock/copynumbers.csv \
--segmentation mock/segments.csv --out_dir results/Perform 25 permutations using the beta approximation method:
python -m segmentqtl --mode perm --chromosome 8 --num_permutations 25 \
--perm_method beta --num_cores 4 \
--genotypes mock/genotypes --quantifications mock/quantifications.csv \
--covariates mock/covariates.csv --copynumber mock/copynumbers.csv \
--segmentation mock/segments.csv --out_dir results/Note that number of permutations should not exceed the number of phenotypes in the full dataset.
Apply false discovery rate (FDR) correction to previously computed results:
python -m segmentqtl --mode fdr --out_dir results/ --fdr_out corrected_results.csvRun SegmentQTL for all variants of a given phenotype id:
python -m segmentqtl --mode nominal --all_variants ENSG00000003987 \
--chromosome 8 --num_cores 1 \
--genotypes mock/genotypes --quantifications mock/quantifications.csv \
--covariates mock/covariates.csv --copynumber mock/copynumbers.csv \
--segmentation mock/segments.csv --out_dir results/Generate QTL plots for all tested phenotypes:
python -m segmentqtl --mode perm --plot_threshold 1 --plot_dir plots/ \
--chromosome 8 --num_cores 4 --num_permutations 25 \
--genotypes mock/genotypes --quantifications mock/quantifications.csv \
--covariates mock/covariates.csv --copynumber mock/copynumbers.csv \
--segmentation mock/segments.csv --out_dir results/If you use SegmentQTL in your work, please cite:
Samuel Leppiniemi, et al. SegmentQTL: Identifying genetic variants influencing molecular phenotypes in copy number-driven cancers. bioRxiv, 2025. https://doi.org/10.1101/2025.07.28.667150

