BCFtools-Pipeline (Nextflow DSL2)

This repository contains a Nextflow DSL2 pipeline for SNP variant calling quality control (QC) and downstream allele frequency annotation stratified by sex and ancestry.

It integrates several QC steps (genotype-, variant-, and sample-level filtering) and recalculates allele frequencies (AFs) based on sample metadata using BCFTools.

Workflow Overview

The pipeline performs the following steps:

INDEX_VCF
- Ensures each input VCF has an index (.tbi).
- If missing, it creates one with tabix.
SPLIT_MULTIALLELIC
- Splits multiallelic variants into biallelic records using bcftools norm.
GENOTYPE_QC
- Masks low-quality genotypes (./.) according to thresholds:
  - Genotype Quality (GQ)
  - Depth of Coverage (DP)
  - Allelic Depth ratio (AD)
- Automatically detects if FORMAT tags exist before applying filters.
VARIANT_QC
- Filters out low-quality variants based on INFO and QUAL fields.
- Supported filters: QUAL, QD, DP, MQ, FS, ReadPosRankSum.
SAMPLE_QC
- Removes samples failing quality thresholds:
  - Mean coverage
  - Call rate
  - Het/Hom ratio
  - Number of singletons
  - Contamination (using sceVCF)
Thresholds differ depending on seq_type (WES vs WGS).
ADD_AF
- Uses a metadata CSV file to define groups (sex + ancestry).
- Annotates the final VCF with total and group-specific allele frequencies.

Technical requirements

Nextflow ≥ 24.04
Java ≥ 17
bcftools ≥ 1.10
tabix (htslib) ≥ 1.13
sceVCF binaries for contamination checks

Input Requirements

VCF files (.vcf.gz) in the input directory.

⚠️ Important: For allele frequency (AF) recalculations to be valid, the input VCFs should contain only unrelated individuals.
Index files (.tbi) [OPTIONAL], will be created if missing.
Metadata CSV with columns SAMPLE_ID, SEX and ANCESTRY.

Example:
```
SAMPLE,SEX,ANCESTRY
sample1,1,EUR
sample2,2,AFR
sample3,1,EAS
```
Some important considerations about the submitted metadata:
- Ensure the order of your metadata columns is the one shown above.
- Sex should be coded as 1 = MALE, 2 = FEMALE.
- Ancestry codes are free-text (e.g., EUR, AFR, EAS). But, take into consideration that the naming on the CSV will be used to annotate the ancestry AF fields in the VCF.
sceVCF binary path

Configuration

Edit nextflow.config to adjust parameters.

params {
    // Paths
    input        = "/path/to/folder/with/vcfs/"
    output_stats = "/home/mireia/GitHub/bcftools-pipeline/bcftools-pipeline"
    sceVCF_path  = "/path/to/sceVCF_binaries"
    metadata_csv = "/path/to/metadata.csv"  // sample/sex/ancestry
    seq_type     = "WGS" // or "WES"
    threads      = 4

     // Pipeline behavior
    seq_type     = 'WGS' // or 'WES'
    threads      = 4

     // -----------------------
    // QC Thresholds
    // -----------------------
    qc {

        // Genotype QC defaults 
        genotype {
        gq_threshold         = 20
        dp_threshold         = 10
        ad_ratio_threshold   = 0.2
        }

        // Variant QC defaults
        variant {
        qual_threshold                 = 30
        qd_threshold                   = 2.0
        dp_threshold                   = 10
        mq_threshold                   = 40
        fs_threshold                   = 60
        read_pos_rank_sum_threshold    = -8.0
        }

        // Sample QC (choose set via `seq_type`)
        sample {
        // Whole-Exome defaults
        wes {
            coverage_threshold       = 10
            het_hom_threshold        = 10
            call_rate_threshold      = 0.95
            singletons_threshold     = 5000
            contamination_threshold  = 0.00015
        }
        // Whole-Genome defaults
        wgs {
            coverage_threshold       = 15
            het_hom_threshold        = 3.3
            call_rate_threshold      = 0.95
            singletons_threshold     = 100000
            contamination_threshold  = 0.05
        }
        }
    }
}

Notes

seq_type selects which Sample QC thresholds are applied (wes vs wgs).
metadata_csv is required.
The fields INFO/QD, INFO/DP, INFO/MQ, INFO/FS, INFO/ReadPosRankSum, FORMAT/GQ, FORMAT/GT and FORMAT/AD must be correctly described in the header for the filtering to work correctly.

Running the Pipeline

Basic run:

nextflow run main.nf

With custom parameters

nextflow run main.nf \
  --input "/data/vcfs/" \
  --metadata_csv "/data/metadata.csv" \
  --sceVCF_path "/tools/sceVCF" \
  --seq_type "WGS" \
  --threads 8

NOTE: This will overwrite the configuration parameters set in nextflow.config

HTML report and timeline

nextflow run main.nf -with-report report.html -with-timeline timeline.html

This command will automatically generate two reports, report.html and timeline.html.

report.html will contain:

Process run counts and statuses (cached/failed/succeeded)
CPU, memory, time usage per process
I/O stats and container info (if used)
Cache hit ratio and overall runtime

timeline.html will contain:

Each task’s start/finish times
Parallelism and queueing
Retries/restarts and durations

Outputs

After a successful execution, you'll find inside the /work folder:

Intermediate outputs

Indexed VCFs (.tbi)
Split multiallelic VCFs
Masked genotypes VCFs
Variant-filtered and masked genotypes VCFs
Variant and Sample-filtered with masked genotypes VCFs
sample-qc-fails.tsv file with the samples deleted and which filter step failed

Final output

input_vcf_baseName.vcf.gz (TODO: make the name the same as the original VCF with -AF_recalc) → fully filtered VCF with allele frequency annotations and witout sample level information.

input_vcf_baseName.vcf.gz.tbi → index for the final VCF.
In $output_stats there will be a log file with information about each step of the pipeline run per file.

💡 How to find the files inside the /work folder?

Example a of nextflow output:

[34/34b365] process > INDEX_VCF (test.vcf.bgz)                                           [100%] 1 of 1 ✔
[bc/56f27a] process > SPLIT_MULTIALLELIC (test.vcf.bgz)                                  [100%] 1 of 1 ✔
[46/bf10e7] process > GENOTYPE_QC (test_split-multiallelic.vcf.bgz)                      [100%] 1 of 1 ✔
[88/8b101c] process > VARIANT_QC (test_split-multiallelic-masked.vcf.gz)                 [100%] 1 of 1 ✔
[33/a6c930] process > SAMPLE_QC (test_split-multiallelic-masked-filtered.vcf.gz)         [100%] 1 of 1 ✔
[c7/065171] process > ADD_AF (test_split-multiallelic-masked-filtered.sample_qc.vcf.bgz) [100%] 1 of 1 ✔

The intermediate files are saved inside the nextflow folder for each step.

Go to /work/bc/56f27a[-->] and you'll find the VCF with the multiallelic variants splitted.

Go to /work/33/a6c930[-->] and you'll find all the intermediate files created during the sample QC.

Go to /work/c7/065171[-->] for the final files of the pipeline.

Recomendation: Use at least the argument -with-report report.html, this will always show you the folders where the files have been saved.

How to generate metadata_csv

If you don't have the necessary metadata to create metadata.csv you can infer it using tools such as GRAF tools and Hail.

Graf tools

GRAF provides functions to infer sex (graf sex, PLINK input), detect related samples (graf rel, PLINK input), and assign ancestry (graf anc, VCF input). Note that, for ancestry inference, only super-population calls (e.g. European, Asian, African...) are recommended, since finer-level predictions are not sufficiently accurate.

Hail

Hail offers similar capabilities with impute_sex for sex inference and king and pc_relate for relatedness filtering.

References

Lu W, Gauthier LD, Poterba T, Giacopuzzi E, Goodrich JK, Stevens CR, King D, Daly MJ, Neale BM, Karczewski KJ. CHARR efficiently estimates contamination from DNA sequencing data. bioRxiv [Preprint]. 2023 Jun 28:2023.06.28.545801. doi: 10.1101/2023.06.28.545801. Update in: Am J Hum Genet. 2023 Dec 7;110(12):2068-2076. doi: 10.1016/j.ajhg.2023.10.011. PMID: 37425834; PMCID: PMC10327099.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BCFtools-Pipeline (Nextflow DSL2)

Workflow Overview

Technical requirements

Input Requirements

Configuration

Running the Pipeline

Basic run:

With custom parameters

HTML report and timeline

Outputs

How to generate metadata_csv

Graf tools

Hail

References

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
sceVCF		sceVCF

EGA-archive/AF_bcftools_pipeline

Folders and files

Latest commit

History

Repository files navigation

BCFtools-Pipeline (Nextflow DSL2)

Workflow Overview

Technical requirements

Input Requirements

Configuration

Running the Pipeline

Basic run:

With custom parameters

HTML report and timeline

Outputs

How to generate metadata_csv

Graf tools

Hail

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages