6. Parameter List

Parameters

The following is a complete list of parameters that can be used when running getphylo:

Help and Logging

Help

-h, --help show this help message and exit

Provides a list of parameters and other information.

Logging

-l {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}, --logging {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET} set the logging level (default: ERROR)

Set the logging level. Most users will only ever need the default setting (ERROR) or to set -l INFO for basic logging. DEBUG can also be used, but this is extremely verbose and is intended for development purposes only.

Input and Output

Specify genbank files

-g GBKS, --gbks GBKS string indicating the genbank files to use in the phylogeny (default: *.gbk)

This will point getphylo to the input files. Ensure that the input is formatted as a string, e.g.: '-g ./input/*.gb'.

Specify output directory

-o OUTPUT, --output OUTPUT a string designating the name of the folder to output the results(default: output)

This points getphylo to a specified output folder. By default results will be stored at ./output. getphylo will make a new directory if necessary.

Specify seed

-s SEED, --seed SEED path to a genbankfile with for the target organism (default: None)

This sets the 'seed genome.' By default getphyl choses the first genome alphabetically in the input file list. The seed genome will only impact the analysis in two specific scenarios. Firstly, if you are analysing genomes of dramatically different sizes. In this case it is optimal to choose the smallest genome as this will make the run slightly faster as the starting list of possible marker genes will be lower. Secondly, if you have the --presence parameter is set lower than 100. In this case, the seed genome may effect the resulting tree. In this case, it is advisable to use the outgroup as the seed. However, in the case it is valuable to run the analysis more than once using different seeds to ensure there is no effect. See the section on the --presence parameter below for more details.

Specify random seed number

-r RANDOM_SEED_NUMBER, --random-seed-number RANDOM_SEED_NUMBER interger to be used as a seed for randomising loci selection, random if left as None(default: None)

If a limit is set on the number of loci wiht --maxloci, getphylo will randomly select marker genes. The random seed paramater can be used to choose a custom seed for this shuffling event.

Specify annotation tag

-t TAG, --tag TAG string indicating the GenBank annotations to extract (default: locus_tag)

This defines the genbank annotations for the protein sequences that getphylo will extract. getphylo searches for all CDS features with the provided tag. If your data does not contain locus_tag annotations, another common tag to use is protein_id. Ensure all data is uniformly formatted, when using getphylo!

Loci Thresholding Parameters

Specify the number of loci to find in the seed genome

-f FIND, --find FIND integer indicating the number of loci to find in the seed genome (default: -1)

For large genomes, runtime can be reduced by limiting the number of loci to search for in the seed genome. This is NOT RECCOMMENDED.

Specify the maximum length of target loci

-max MAXLENGTH, --maxlength MAXLENGTH interger indicating the minimum length of loci to be included in the analysis (default: 2000)

Max length can be used to limit the maximum length of marker genes. This filter is limited to exclude longer genes with multiple domains (e.g. PKSs) that may confuse the analaysis. This can be raised to include more loci.

Specify the minimum length of target loci

-min MINLENGTH, --minlength MINLENGTH interger indicating the minimum length of loci to be included in the analysis (default: 200)

Min length can be used to limit the minimum length of marker genes. This filter is limited to exclude short genes (e.g. pseudogenes)that may confuse the analysis. This can be lowered to include more loci.

Specify the minimum number of target loci

-minl MINLOCI, --minloci MINLOCI minimum number of loci required to continue to alignment and tree building steps (default: 1)

The minimum number of marker genes required for the workflow to continue. This can be used to prematurely end runs where limited number of marker genes are avaliable.

Specify the maximum number of target loci

-maxl MAXLOCI, --maxloci MAXLOCI maximum number of loci required to continue to alignment and tree building steps (default: 1000)

The minimum number of marker genes required for the workflow to continue. You may wish to limit the total number to improve the performance of the analysis. For example, when analysing closely related taxa it is possible to find 1000s of hits; this will increase runtime significantly and may cause memory issues on some machines.

Specify the percentage of genomes a target loci must be present in

-p PRESENCE, --presence PRESENCE integer indicating the percentage of genomes each loci must be present in (default: 100)

The percentage of genomes the marker needs to be present in. Use with caution! This parameter is very useful when analyzing distantly related strains as there may be few markers available in all genomes. Lowering the percentage has two potential drawbacks. Firstly, it will introduce missing data into the alignment which may decrease the quality of the resulting tree. Alignments should be checked by the user to assess quality. This will also make the list of markers dependent on the seed genome. This is not necessarily a problem, but it is advisable to check the output from multiple seeds in this instance.

Tree Building

Build trees from all alignments

-b BUILD_ALL, --build-all BUILD_ALL build phylogenetic trees for all loci, not just concatenated alignment (default: 0)

This instructs getphylo to run fasttree on all alignments. By default a tree is built from the combined alignment only. This parameter will increase the runtime significantly.

Handling Poorly Formatted Data

Ignore bad gene annotations

-ia IGNORE_BAD_ANNOTATIONS, --ignore-bad-annotations IGNORE_BAD_ANNOTATIONS ignore missing annotations - NOT RECCOMMENDED (default: False)

Setting this parameter to TRUE will instruct getphylo to ignore CDSs that are missing annotations. A missing annotation error will occur if a CDS does not contain the annotation specified with the --tag parameter. In most cases all CDSs should contain a locus_tag or a protein_id. Missing annotations may suggest an error in the annotation pipeline. NOT RECCOMMENDED!

Ignore bad records

-ir IGNORE_BAD_RECORDS, --ignore-bad-records IGNORE_BAD_RECORDS ignore poorly formatted records - NOT RECCOMMENDED (default: False)

Setting this parameter to TRUE will instruct getphylo to ignore records that are missing CDS annotations. Records with missing annotations almost certainly result from a problem in the annotation pipeline. However, when pulling records from public datasets data may be poorly annotated and it may save time to simply skip poorly formatted records. NOT RECCOMMENDED!

Checkpointing

Set checkpoint

-cp {START,FASTA_EXTRACTED,DIAMOND_BUILT,SINGLETONS_IDENTIFIED,SINGLETONS_SEARCHED,SINGLETONS_THRESHOLDED,SINGLETONS_EXTRACTED,SINGLETONS_ALIGNED,ALIGNMENTS_COMBINED,TREES_BUILT,DONE}, --checkpoint {START,FASTA_EXTRACTED,DIAMOND_BUILT,SINGLETONS_IDENTIFIED,SINGLETONS_SEARCHED,SINGLETONS_THRESHOLDED,SINGLETONS_EXTRACTED,SINGLETONS_ALIGNED,ALIGNMENTS_COMBINED,TREES_BUILT,DONE} string indicating the checkpoint to start from START = default FASTA_EXTRACTED = Skip extracting fasta sequences from genbank files DIAMOND_BUILT = Skip building diamond databases SINGLETONS_IDENTIFIED = Skip identifying singletons from the seed genome SINGLETONS_SEARCHED = Skip searching singletons against other genomes SINGLETONS_THRESHOLDED = Skip thresholding of singletons SINGLETONS_EXTRACTED = Skip extract fasta sequences for alignments SINGLETONS_ALIGNED = Skip individual protein alignments ALIGNMENTS_COMBINED = Skip combining alignments TREES_BUILT = Skip building trees DONE = Done (default: START)

getphylo employs a checkpoint feature that can allow the analysis to be restarted from certain steps. Checkpoints can also be used to start the analysis from later steps in the workflow. A common usage of this command might be in cases where you have a set of protein sequences or alignments for analysis without the original genbank files. An example of how this can be achieved is shown in Case Study 4.

Performance

Set CPUs

-c CPUS, --cpus CPUS The number of cpus to use for paralleslisation (default: 1)

getphylo parallelizes certain sets so using more CPUs can dramatically decrease run time. Therefore, it is always recommended to use extra CPUs if possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6. Parameter List

Parameters

Help and Logging

Help

Logging

Input and Output

Specify genbank files

Specify output directory

Specify seed

Specify random seed number

Specify annotation tag

Loci Thresholding Parameters

Specify the number of loci to find in the seed genome

Specify the maximum length of target loci

Specify the minimum length of target loci

Specify the minimum number of target loci

Specify the maximum number of target loci

Specify the percentage of genomes a target loci must be present in

Tree Building

Build trees from all alignments

Handling Poorly Formatted Data

Ignore bad gene annotations

Ignore bad records

Checkpointing

Set checkpoint

Performance

Set CPUs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally