ViralQC is a tool and package for quality control of consensus virus genomes which uses the nextclade tool, BLAST and a series of internal logics to classify viral sequences and perform quality control of complete genomes, regions or target genes.
The complete documentation is available at read the docs.
First, install the dependencies:
or with micromamba
micromamba install \
-c conda-forge \
-c bioconda \
"python>=3.8.0,<3.12.0" \
"snakemake-minimal>=7.32.0,<7.33.0" \
"blast>=2.16.0,<2.17.0" \
"nextclade>=3.15.0,<3.16.0" \
"seqtk>=1.5.0,<1.6.0" \
"ncbi-datasets-cli>=18.9.0,<18.10.0" \
"taxonkit>=0.20.0,<0.21.0"Then, install viralQC
pip install viralQCgit clone https://github.com/InstitutoTodosPelaSaude/viralQC.git
cd viralQCmicromamba env create -f env.yml
micromamba activate viralQCpip install -e .vqc --helpThis command configures local datasets using nextclade. It is necessary to run at least once to generate a local copy of the nextclade datasets, before running the run command
vqc get-nextclade-datasets --cores 2A directory name can be specified, the default is datasets.
vqc get-nextclade-datasets --cores 2 --datasets-dir <directory_name>This command configures local blast database with all ncbi refseq viral genomes. It is necessary to run at least once to generate a local blast database, before running the run command.
vqc get-blast-database --cores 2A output directory name can be specified, the default is datasets.
vqc get-blast-database --cores 2 --output-dir <directory_name>This command runs several steps to identify viruses represented in the input FASTA file and executes Nextclade for each identified virus/dataset.
vqc run --input test_data/sequences.fastaSome parameters can be specified:
--input— Required. Path to the input FASTA file.--output-dir— Output directory name. Default:output--output-file— File to write final results. Valid extensions: .csv, .tsv or .json. Default:results.tsv--datasets-dir— Path to local directory containing nextclade datasets. Default:datasets--ns-min-score— Nextclade sort min score. Default:0.1--ns-min-hits— Nextclade sort min hits. Default:10--blast-database— Path to local blast database. Default:datasets/blast.fasta--blast-database-metadata— Path to local blast database metadata. Default:datasets/blast.tsv--blast-pident— Identity threshold for BLAST analysis (percentage). Default:80--blast-evalue— E-value threshold for BLAST analysis. Default:1e-10--blast-qcov— Minimum query coverage per HSP for BLAST analysis (percentage). Default:80--cores— Number of threads/cores to use. Default:1
The output directory has the following structure:
output/
├── identified_datasets/ # Outputs from nextclade sort
│ ├── datasets_selected.tsv # Formatted nextclade sort output showing the mapping between input sequences and local datasets
│ ├── viruses.tsv # Nextclade sort output showing the mapping between input sequences and remote (nextclade_data) datasets
│ ├── viruses.external_datasets.tsv # Nextclade sort output showing the mapping between input sequences and external (outside nextclade_data) datasets
│ ├── unmapped_sequences.txt # Names of input sequences that were not mapped to any virus on nextclade sort
│ └── <virus/dataset>/ # Sequences for each dataset split into sequences.fa files
│ └── sequences.fa
├── blast_results/ # BLAST analysis results
│ ├── unmapped_sequences.blast.tsv # BLAST results for unmapped sequences
│ └── blast_viruses.list # List of unique virus accessions identified by BLAST
├── nextclade_results/ # Nextclade run outputs
│ ├── <virus>.nextclade.tsv # Standard nextclade run output for each identified virus
│ └── <accession>.generic.nextclade.tsv # Generic nextclade run output for BLAST-identified viruses
├── gff_files/ # GFF annotation files
│ ├── <virus>.nextclade.gff # GFF files from standard nextclade runs
│ └── <accession>.generic.nextclade.gff # GFF files from generic nextclade runs
├── logs/ # Log files for all pipeline steps
│ ├── nextclade_sort.log
│ ├── blast.log
│ ├── nextclade.<virus>.log
│ ├── generic_nextclade.<virus>.log
│ ├── pp_nextclade.log
│ └── extract_target_regions.log
├── results(.tsv,.csv,.json) # Final combined results (format: .tsv, .csv, or .json)
├── sequences_target_regions.bed # BED file with target regions (includes 4 columns: seqname, start, end, region_name)
└── sequences_target_regions.fasta # FASTA file with extracted target region sequences
vqc-serverGo to http://127.0.0.1:8000/docs
Install development dependencies and run black into viralqc directory.
pip install -e ".[dev]"
black viralqcThis command is responsible for downloading and configuring the Nextclade datasets required for analysis. It reads a configuration file (config/datasets.yml) to determine which datasets to fetch.
Logic:
- Read Configuration: The command reads the
config/datasets.ymlfile to identify the list of viruses and their corresponding Nextclade dataset names or GitHub repositories. - Download Nextclade Datasets: For viruses hosted on Nextclade's official repository, it uses the
nextclade dataset getcommand to download the reference sequence, gene map, and other necessary files. - Download GitHub Datasets: For viruses hosted on GitHub, it uses a custom Python script (
get_github_dataset.py) to download the dataset files. - Generate Minimizer Index: For the GitHub datasets, it generates a minimizer index (
external_datasets_minimizers.json) usingget_minimizer_index.py. This index is crucial for thenextclade sortstep to recognize these external datasets.
This command builds a local BLAST database containing all viral genomes from NCBI RefSeq. This database is used to identify sequences that are not recognized by Nextclade's default datasets.
Logic:
- Download Genomes: It downloads all viral genomes from NCBI RefSeq using the
datasetsCLI tool. - Format Database: It processes the downloaded files to create a FASTA file (
blast.fasta) and extracts metadata (accession, virus name, taxonomy ID) into a TSV file (blast.tsv). - Taxonomy Information: It downloads NCBI taxonomy data and uses
taxonkitto add species names and taxonomy IDs to the metadata. - Create BLAST DB: It uses
makeblastdbto create the searchable BLAST database. - Generate GFFs: It converts the JSONL annotation reports from NCBI into GFF3 format using
jsonl_to_gff.py. Crucially, this step filters out accessions where the CDS length is not divisible by 3 or other structural anomalies, ensuring that only valid annotations are used for generic Nextclade runs.
This is the main analysis command. It takes a FASTA file of query sequences and performs virus identification, clade assignment, and quality control.
Logic:
- Nextclade Sort: The input sequences are first passed to
nextclade sort. This tool compares the sequences against the local Nextclade datasets (including the external ones via the minimizer index) to identify which virus each sequence belongs to. - Datasets Identification:: Identify the virus/dataset that each sequence belongs to with format_nextclade_sort.py
- Mapped Sequences: Sequences that match a known dataset are assigned to that virus.
- Unmapped Sequences: Sequences that do not match any dataset are flagged as "unmapped".
- BLAST Analysis: The unmapped sequences are queried against the local BLAST database (
blastn) to identify their closest viral match. Here the parameters--blast-pident,--blast-evalueand--blast-qcovcan be specified to control the BLAST sensitivity, the default values are80,1e-10and80respectively which are not indicated for metagenomic analysis of unknown viruses. - Nextclade Analysis:
- A. Standard Run: For sequences mapped to a known Nextclade dataset,
nextclade runis executed using that dataset. - B. Generic Run: For sequences identified by BLAST (but not in the Nextclade datasets), a "generic"
nextclade runis performed. This uses the reference sequence and GFF annotation from the BLAST database (created byget-blast-database). - CDS Reordering: For all nextclade results, the
reorder_cds.pyscript is executed to reorder the CDS information in thecdsCoveragecolumn. The alphabetic order from nextclade results is replaced by the order in which each CDS appears in the genome (based on start position in the GFF file).
- A. Standard Run: For sequences mapped to a known Nextclade dataset,
- Post-Processing: The results from all
nextclade runexecutions (standard and generic) and the BLAST results for any remaining unmapped sequences are combined into a single output file (results.tsv). - Target Region Extraction: Finally, specific genomic regions of interest (if defined) are extracted from the sequences based on quality criteria.