-
Notifications
You must be signed in to change notification settings - Fork 0
Inputs and Outputs
You can provide single folder with all the files in one place or multiple folders. Folders are searched recursively for inputs with extensions *.f*q.gz. So be careful you don't have files in the folders you don't want included. The following file structure will load both sets of files if you provide the parent folder as input. You can also provide multiple separate folders using -i as shown above.
For example if you provide -i data with the following structure:
data/
├── ERR1588781
│ ├── ERR1588781_1.fq.gz
│ └── ERR1588781_2.fq.gz
└── ERR1588785
├── ERR1588785_1.fastq.gz
└── ERR1588785_2.fastq.gz
Filenames are parsed and a sample name is extracted for each pair (if paired end). This is simply done by splitting on the _ symbol by default. So a file called /path/13-11594_S85_L001-4_R1_001.fastq.gz will be given a sample name 13-11594. As long as the sample names are unique this is ok. If you had a file names like A_2_L001-4_R1_001, A_3_L001-4_R1_001 you should split on '-' instead. You can specify this in the labelsep option. The workflow won't run unless sample names are unique.
You can use a manifest file (-M) with the sample names and files in a table if parsing folders won't work for you. This could be useful if your files have non-unique names but are in different subfolders. This overrides the input option. The format of the file is below. You should give the full path of each file. Sample names have to be unique and you should provide an entry for all the samples you want to run.
sample,filename1,filename2
S1,/folder/S1/cleaned_1.fastq.gz,/folder/S1/cleaned_2.fastq.gz
S2,/folder/S2/cleaned_1.fastq.gz,/folder/S2/cleaned_2.fastq.gz
S3,/folder/S3/cleaned_1.fastq.gz,/folder/S3/cleaned_2.fastq.gz
These files will be saved to the output folder when the workflow is finished.
raw.bcf - unfiltered output from bcftools mpileup, not overwritten by default
calls.vcf - unfiltered variant calls
filtered.vcf.gz - filtered vcf from all variant calls
snps.vcf.gz - snps only calls, used to make the core alignment
indels.vcf.gz - indels only, made from filtered calls
core.fa - fasta alignment from core snps, can be used to make a phylogeny
core.txt - text table of core snps
csq.tsv - consequence calls (if genbank provided)
csq_indels.tsv - consequence calls for indels
csq.matrix - matrix of consequence calls
snpdist.csv - comma separated distance matrix using snps
samples.csv - summary table of samples
RAxML_bipartitions.variants - ML tree if RAxML was used, optional
tree.newick - tree with SNPs branch lengths, if RAxMl used