Iterative SVD Completion method for ancestry-specific dimensionality reduction.
Run the method using the following command from command line.
python3 iterative_svd_method.py params.txtparams.txt is the parameters file that is passed as input to the method. The following parameters can be specified in the parameters file:
BEAGLE_OR_VCF(int):1if the genetic data file is a Beagle file, or2if it is a VCF file.BEAGLE_FILE(str): path to the Beagle file if the genetic data file is a Beagle file.VCF_FILE(str): path to the VCF file if the genetic data file is a VCF file.IS_MASKED(bool):Trueif an ancestry file is passed for ancestry-specific masking, orFalseotherwise.VIT_OR_FBK_OR_TSV(int):1if the ancestry file is a VIT file,2if it is an FBK file, or3if it is a TSV file.VIT_FILE(str): path to the VIT file if the ancestry file is a VIT file.FBK_FILE(str): path to the FBK file if the ancestry file is a FBK file.FB_OR_MSP(int):1if the TSV ancestry file is an FB file, or2if it is an MSP file.TSV_FILE(str): path to the TSV file if the ancestry file is a TSV file.NUM_ANCESTRIES(int): the total number of ancestries in the ancestry file.ANCESTRY(int): ancestry number of the ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0 if the ancestry file is a TSV file, and starts at 1 if it is a VIT or an FBK file.PROB_THRESH(float): minimum probability threshold for a SNP to belong to an ancestry, if the ancestry file is an FBK file or an FB TSV file.AVERAGE_PARENTS(bool):Trueif the DNAs from the two parents are to be combined (averaged) for each individual, orFalseotherwise.START_RANK(int): smallest rank among the range of ranks, from which the best rank is to be chosen for Iterative SVD using cross-validation.END_RANK(int) = largest rank among the range of ranks, from which the best rank is to be chosen for Iterative SVD using cross-validation.RANK(int): rank for Iterative SVD if no cross-validation is to be performed.CHOOSE_BEST(bool):Trueif cross-validation is to be performed to choose the best rank for Iterative SVD, orFalseotherwise.NUM_CORES(int): number of cores to use for performing cross-validation in parallel.IS_WEIGHTED(bool):Trueif weights are provided in the labels file, orFalseotherwise.LABELS_FILE(str): path to the labels file. It should be a TSV file where the first column has headerindIDand contains the individual IDs, and the second column has headerlabeland contains the labels for all individuals. IfIS_WEIGHTEDis specified asTrue, then the file must have a third column that has headerweightand contains the weights for all individuals. NOTE: Individuals with zero weight are removed. Negative weights are used to combine individuals and replace them with a single average individual. Provide a weight of-1to the first set of individuals to be combined,-2to the second set of individuals to be combined, and so on. Each set of individuals that is to be combined must have the same label.OUTPUT_FILE(str): path to the output file, to which the output of the run is written. It is a TSV file with 3 columns. The first column contains the individual IDs, and the second and third column contain the ancestry-specific projections obtained after dimensionality reduction.SCATTERPLOT_FILE(str): path to the scatter plot file with.htmlextension. The scatter plot of the individuals is saved in this file.SAVE_MASKED_MATRIX(bool):Trueif the masked matrix is to be saved as a binary file, orFalseotherwise.MASKED_MATRIX_FILE(str): path to the masked matrix file. The masked matrix is saved in this file.SAVE_COMPLETED_MATRIX(bool):Trueif the completed matrix is to be saved as a binary file, orFalseotherwise.COMPLETED_MATRIX_FILE(str): path to the completed matrix file. The completed matrix is saved in this file.
NOTE: The parameters file must have all the above parameters. Each line in the parameters file must have a parameter name followed by =, followed by the value for that parameter. The value for a parameter that is not useful for the run can be filled with any value compatible with the parameter type.
NOTE: There are 2 acceptable formats for SNP indices in the Beagle file:
- rsid:
rsfollowed by the id (integer). For example,rs12345. - position: chromosome number (integer) followed by
_, followed by the position (integer). For example,10_12345.
NOTICE: This software is available for use free of charge for academic research use only. Commercial users, for profit companies or consultants, and non-profit institutions not qualifying as "academic research" must contact the Stanford Office of Technology Licensing for a separate license. This applies to this repository directly and any other repository that includes source, executables, or git commands that pull/clone this repository as part of its function. Such repositories, whether ours or others, must include this notice. Academic users may fork this repository and modify and improve to suit their research needs, but also inherit these terms and must include a licensing notice to that effect.