Skip to content

Commit dedb3f7

Browse files
committed
Update explanation of inputs
1 parent 52396a8 commit dedb3f7

File tree

1 file changed

+11
-7
lines changed

1 file changed

+11
-7
lines changed

README.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -29,17 +29,21 @@ Options:
2929

3030
### Inputs and outputs
3131

32-
The software expects a simple table in CSV format (possibly extracted from a VCF file) containing one genetic variants in each row as an input. There must be four columns:
32+
The program takes as input a table in CSV format (possibly derived from a VCF file) where each row represents a single genetic variant. The input table must contain four columns:
3333

34-
- `sample` (a string): a unique identifier for the group of variants for the pairwise comparison.
35-
- `position` (an integer): the start site of the variant.
34+
- `sample` (a string): a unique identifier for the group of variants used in pairwise comparisons.
35+
- `position` (an integer): the site of the variant.
3636
- `sequence` (a string): the sequence of the variant (i.e. the alternate allele).
37-
- `frequency` (a real number from 0 to 1): the relative frequency of the variant in the `sample`.
37+
- `frequency` (a real number from 0 to 1): the relative frequency of the variant within the sample.
3838

39-
It also expects the reference sequence used for variant calling in FASTA format. As a result, a table in CSV format is produced. This table contains three columns:
39+
In addition to the variant table, the program requires a reference sequence in FASTA format. The sequence should be the same one used for variant calling. This reference is used to infer the frequencies of reference alleles, assuming that any frequency not taken up by listed variants belongs to the reference allele at that site. In addition to the pairwise distance between samples, the distance between each sample and the reference sequence is also calculated by building a reference sample as a baseline with no variant alleles (i.e. all sites are assumed to have an allele frequency of 1).
4040

41-
- `sample_m` and `sample_n` (strings): the identifiers of the two samples subject to the distance calculation, taken from the input `sample` column.
42-
- `distance` (a real number): the calculated pairwise distance between the samples.
41+
The distance of each sample is calculated against the reference as well, treating it as a normal sample with no allele variants (all reference allele frequencies are fixed within the reference virtual sample).
42+
43+
As a result, a table in CSV format is produced. This table contains three columns:
44+
45+
- `sample_m` and `sample_n` (strings): the identifiers of the two samples being compared.
46+
- `distance` (a real number): the calculated pairwise distance between the two samples.
4347

4448
## Citation
4549

0 commit comments

Comments
 (0)