If you have any questions or feature requests, please leave a detailed issue. If you use RIPTA in your work, please cite doi: https://doi.org/10.1101/2022.12.02.518851.
usage: site_patterns.py [-h] -y YAML_PATH
arguments:
-h, --help Show this help message and exit.
-y YAML_PATH, --yaml_path YAML_PATH
Path to the YAML configuration file.Currently, all parameters are required and must be specified in the configuration file; however, this will likely change once I implement non-site pattern statistics and sliding-window capabilities.
vcf_path: List of file paths to the VCF files to be analyzed.contig_length: List of contig (chromosome) lengths for each corresponding VCF file.block_size: Size of the blocks for the weighted-block Jackknife procedure (see theappendixdirectory for the exact equations used for the weighted-block Jackknife procedure written by Dr. Patterson).meta_data_path: Path to a tab-separated text file with two columns: Individuals and Population. The individual IDs must be identical to how they appear in all VCF headers.results_path_prefix: Path to write the results to i.e.,{results_path_prefix}_site_patterns.csv.log_path_prefix: Path to write all the errors and warnings to i.e.,{log_path_prefix}_config_errors.logand{log_path_prefix}_vcf_processing_warnings.log.use_raf: Boolean to denote filtering by the reference panel allele frequency (designed for analyzing imputed genomes).raf_threshold: Float to denote the reference panel allele frequency threshold for filtering (designed for analyzing imputed genomes).number_of_threads: Positive integer for the number of threads to use for parallelization; to use all available threads set this parameter to'max'.compute_site_patterns: Boolean to denote you want to compute site pattern statistics.site_pattern_config_path: Path to a tab-separated text file with eight columns: P1, P2, P3, P4, P1 Sampling, P2 Sampling, P3 Sampling, and P4 Sampling.- For the first four columns (P1, P2, P3, P4) specify the individual ID or population ID which appears in the meta data file.
- For the last four columns (P1 Sampling, P2 Sampling, P3 Sampling, P4 Sampling) specify one of the three implemented sampling methods:
pop_freq: To denote computing allele frequencies for a population of individuals.ind_freq: To denote computing allele frequencies for a single individual (i.e., 0 = homozygous ancestral, 0.5 = heterozygous, 1 = homozygous derived).ind_gp: To denote computing allele frequencies for an individual from genotype probabilities.
site_pattern_stats: List of site pattern statistics to compute currently the options are'D','Danc','D+','fhom','fanc', and/or'f+'.
Before computing any summary statistics, RIPTA will first validate the entire YAML configuration file and report all the detected errors before exiting. During summary statistic computation, RIPTA will log a warning for any position in the VCF file that is excluded from the analysis. Below is a list of all possible error and warning messages, their meanings, and suggested solutions.
'CONFIG ERROR: VCF file {VCF} not found!'- Indicates that no VCF file exists at the path specified by the
vcf_pathparameter. - Solution: Ensure that each VCF file path is correctly specified.
- Indicates that no VCF file exists at the path specified by the
'CONFIG ERROR: Meta data file {META_DATA} not found!'- Indicates that no meta data file exists at the path specified by the
meta_data_pathparameter. - Solution: Ensure the path for the meta data file is correctly specified.
- Indicates that no meta data file exists at the path specified by the
'CONFIG ERROR: Site pattern configuration file {SITE_PATTERN_CONFIG} not found!'- Indicates that no site pattern configuration file exists at the path specified by the
site_pattern_config_pathparameter. - Solution: Ensure the path for the site pattern configuration file is correctly specified.
- Indicates that no site pattern configuration file exists at the path specified by the
'CONFIG ERROR: The vcf_path ({N_VCFS}) and contig_length ({N_CONTIG_LENGTHS}) paramters must have the same number of entries!'- Indicates a mismatch between the number of VCF files specified by
vcf_pathand the number of contig lengths specified bycontig_lengthparameters. - Solution: Ensure that the configuration file specifies equal numbers of VCF files and contig lengths.
- Indicates a mismatch between the number of VCF files specified by
CONFIG ERROR: The Block size must be smaller than the contig lengths!- Indicates that the block size specified by the
block_sizeparameter is larger than the specified contig length. - Solution: Ensure that the block size is smaller than each specified contig length.
- Indicates that the block size specified by the
CONFIG ERROR: The RAF threshold must be between 0 and 1!- Indicates that the RAF threshold specified by the
raf_thresholdparameter is invalid. - Solution: Ensure that the RAF threshold is set to a value between 0 and 1.
- Indicates that the RAF threshold specified by the
CONFIG ERROR: The number of threads must be a positive integer or max!"- Indicates an invalid number of threads as specified by the
number_of_threadsparameter. - Solution: Ensure that the number of threads is a positive integer or
'max'to use all available threads.
- Indicates an invalid number of threads as specified by the
'CONFIG ERROR: The site_pattern_stats paramter must be specified when the compute_site_patterns parameters is set to True!'- Indicates missing site pattern statistics specified by the
site_pattern_statsparameter. - Solution: Ensure at least one valid site pattern statistic is specified.
- Indicates missing site pattern statistics specified by the
'CONFIG ERROR: Invalid site pattern statistics ({INVALID_STATS}) specified!'- Indicates that there are invalid site pattern statistics specified by the
site_pattern_statsparameter. - Solution: Refer to the
YAML Configuration File Parameterssection for valid site pattern statistics.
- Indicates that there are invalid site pattern statistics specified by the
'SITE PATTERN CONFIG ERROR: Line {LINE_NUMBER} has {N_COLUMNS} columns, expected 8!'- Indicates that a line in the site pattern configuration file has an incorrect number of columns.
- Solution: Refer to the
YAML Configuration File Parameterssection to correctly format the site pattern configuration file.
'SITE PATTERN CONFIG ERROR: Invalid sampling option "{SAMPLING_OPTION}" for P{X} "{POP_ID}" on line {LINE_NUMBER}, the sampling option must be either "pop_freq", "ind_freq", or "ind_gp"!'- Indicates an invalid sampling option in the site pattern configuration file.
- Solution: Refer to the
YAML Configuration File Parameterssection to correctly format the site pattern configuration file.
'SITE PATTERN CONFIG ERROR: Error reading site pattern configuration file - {ERROR}'- Indicates that the site pattern configuration file could not be read.
- Solution: Refer to the
YAML Configuration File Parameterssection to correctly format the site pattern configuration file.
'META DATA ERROR: Line {LINE_NUMBER} is invalid, expected at least 2 tab-separated columns!'- Indicates that a line in the meta data file has an incorrect number of columns.
- Solution: Refer to the
YAML Configuration File Parameterssection to correctly format the meta data file.
'META DATA ERROR: Error reading meta data file - {ERROR}'- Indicates that the meta data file could not be read.
- Solution: Refer to the
YAML Configuration File Parameterssection to correctly format the meta data file.
'META DATA ERROR: Required populations not found: {MISSING_POPULATIONS}!'- Indicates missing individuals for populations required for site pattern statistics.
- Solution: Ensure all individuals for all required populations are in the meta data file.
'META DATA ERROR: Required individuals not found: {MISSING_ INDIVIDUALS}!'- Indicates missing individuals in the metadata file for required site pattern statistics.
- Solution: Ensure all required individuals are in the meta data file.
'META DATA ERROR: No samples found in meta data file, please check the errors and ensure the file has at least 2 tab-separated columns where column 1 = individual ID and column 2 = population ID!'- Indicates that none of the required individuals or populations to compute the site pattern statistics were found in the meta data file.
- Solution: Refer to the
YAML Configuration File Parameterssection to correctly format the meta data file.
'VCF ERROR: No header line found in the VCF file {VCF} - (please ensure the VCF file is formatted properly, for help see sections 1.5-1.6 https://samtools.github.io/hts-specs/VCFv4.3.pdf)!'- Indicates that no header line was found in the VCF file.
- Solution: Ensure the VCF file includes a properly formatted header.
'VCF ERROR: Sample {MISSING_INDIVIDUAL} from population {POPULATION} not found in VCF header of {VCF}!'- Indicates that an individual required for computing site pattern statistics is missing from the VCF file header.
- Solution: Ensure that all required individuals are present in the VCF file header.
'VCF ERROR: No samples found for required population {POPULATION} in the VCF file {VCF}!'- Indicates that all required individuals from a population are missing in the VCF file header.
- Solution: Ensure that all required individuals are present in the VCF file header.
'VCF ERROR: Unable to open the VCF file {VCF} - {ERROR}'- Indicates that the VCF file could not be open.
- Solution: Ensure that the VCF file is not corrupted and if the VCF file ends with
.gzthat the VCF file is actually gzipped or bgzipped.
'VCF ERROR: Population mismatch between VCF files {VCF_X} and {VCF_Y} - please ensure that the header lines are consistent across all VCF files!'- Indicates that the two VCF files have different headers.
- Solution: Ensure that all VCF files have the same exact header.
'VCF ERROR: Sample mismatch in population {POPULATION} between VCF files {VCF_X} and {VCF_Y} - please ensure that the header lines are consistent across all VCF files!'- Indicates that the two VCF files have different headers.
- Solution: Ensure that all VCF files have the same exact header.
'CONFIG ERROR: Configuration file {YAML_PATH} not found!'- Indicates that no configuration file exists at the file path specified by the
YAML_PATHparameter. - Solution: Ensure the
YAML_PATHargument is set correctly.
- Indicates that no configuration file exists at the file path specified by the
'YAML ERROR: Error parsing YAML file - {ERROR}'- Indicates an issue in reading the
YAMLconfiguration file. - Solution: Ensure that the
YAMLconfiguration file file is correctly formatted.
- Indicates an issue in reading the
'CONFIG ERROR: Missing the following required parameters {MISSING_REQUIRED_PARAMS}!'- Indicates that not all required parameters were specified in the
YAMLconfiguration file. - Solution: Ensure that the
YAMLconfiguration file file is correctly formatted.
- Indicates that not all required parameters were specified in the
'LOGGING ERROR: Failed to write to log file {LOG_FILE} - {ERROR}'- Indicates an inability to write to the path specified by the
log_path_prefixparameter. - Solution: Ensure that
log_path_prefixspecifies a writable path.
- Indicates an inability to write to the path specified by the
CONFIG ERROR: Configuration file validation failed!- Indicates one or more errors were raised during validation.
- Solution: Correct all errors in the log file, then rerun
RIPTA.
'WARNING: Position {POSITION} on chromosome {CHROMOSOME} is not bi-allelic, continuing to the next site...'- Indicates the presence of a non-bi-allelic SNP.
'WARNING: Position {CURRENT_POSITION} on chromosome {CURRENT_CHROMOSOME} does not have the RAF flag in the INFO columns, continuing to the next site...- Indicates the absence of the RAF flag in the INFO field for a position when the
use_rafparameter is set toTrue.
- Indicates the absence of the RAF flag in the INFO field for a position when the
'WARNING: Position {POSITION} on chromosome {CHROMOSOME} does meet the RAF threshold, continuing to the next site...'- Indicates the position does not meet the specified RAF threshold when the
use_rafparameter is set toTrue.
- Indicates the position does not meet the specified RAF threshold when the
'WARNING: Position {POSITION} on chromosome {CHROMOSOME} has no GP field for {INDIVIDUAL}...'- Indicates missing GP field information at a position when the
ind_gpsampling option is specified.
- Indicates missing GP field information at a position when the
'WARNING: Position {POSITION} on chromosome {CHROMOSOME} has no genotype information for {INDIVIDUAL}...- Indicates missing genotype information for an individual at a specified position.
