You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+11-7Lines changed: 11 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,17 +29,21 @@ Options:
29
29
30
30
### Inputs and outputs
31
31
32
-
The software expects a simple table in CSV format (possibly extracted from a VCF file) containing one genetic variants in each row as an input. There must be four columns:
32
+
The program takes as input a table in CSV format (possibly derived from a VCF file) where each row represents a single genetic variant. The input table must contain four columns:
33
33
34
-
-`sample` (a string): a unique identifier for the group of variants for the pairwise comparison.
35
-
-`position` (an integer): the start site of the variant.
34
+
-`sample` (a string): a unique identifier for the group of variants used in pairwise comparisons.
35
+
-`position` (an integer): the site of the variant.
36
36
-`sequence` (a string): the sequence of the variant (i.e. the alternate allele).
37
-
-`frequency` (a real number from 0 to 1): the relative frequency of the variant in the `sample`.
37
+
-`frequency` (a real number from 0 to 1): the relative frequency of the variant within the sample.
38
38
39
-
It also expects the reference sequence used for variant calling in FASTA format. As a result, a table in CSV format is produced. This table contains three columns:
39
+
In addition to the variant table, the program requires a reference sequence in FASTA format. The sequence should be the same one used for variant calling. This reference is used to infer the frequencies of reference alleles, assuming that any frequency not taken up by listed variants belongs to the reference allele at that site. In addition to the pairwise distance between samples, the distance between each sample and the reference sequence is also calculated by building a reference sample as a baseline with no variant alleles (i.e. all sites are assumed to have an allele frequency of 1).
40
40
41
-
-`sample_m` and `sample_n` (strings): the identifiers of the two samples subject to the distance calculation, taken from the input `sample` column.
42
-
-`distance` (a real number): the calculated pairwise distance between the samples.
41
+
The distance of each sample is calculated against the reference as well, treating it as a normal sample with no allele variants (all reference allele frequencies are fixed within the reference virtual sample).
42
+
43
+
As a result, a table in CSV format is produced. This table contains three columns:
44
+
45
+
-`sample_m` and `sample_n` (strings): the identifiers of the two samples being compared.
46
+
-`distance` (a real number): the calculated pairwise distance between the two samples.
0 commit comments