Skip to content

Commit d0804a2

Browse files
committed
Updated the documentation.
1 parent ffef649 commit d0804a2

File tree

1 file changed

+19
-18
lines changed

1 file changed

+19
-18
lines changed

TestData/README.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,52 @@
1-
#Test Data for *Structure_threader*
1+
# Test Data for *Structure_threader*
22

33
In this directory you will find the data that was used to benchmark *Structure_threader*.
44

5-
##Contents (in alphabetical order):
5+
## Contents (in alphabetical order):
66

7-
* Chr1.str.tar.xz
8-
* Chr22.str.tar.xz
9-
* benchmark.sh
7+
* BigTestData.str.tar.xz
108
* extraparams
119
* joblist.txt
1210
* mainparams
1311
* TestData.structure
1412

15-
###Chr1.str.tar.xz
13+
### BigTestData.str.tar.xz
1614

17-
This file is a fastStructure formatted input file which was used to benchmark fastStructure. This is a **huge** SNP file (16854 SNPs) which was obtained from the [1000 genomes project](http://www.1000genomes.org). The file was downloaded from [here](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz), and was then filtered using vcftools with the following criteria:
15+
This file is a fastStructure formatted input file which was used to benchmark fastStructure. This is a large SNP file (604 SNPs) which was obtained from the [1000 genomes project](http://www.1000genomes.org). The file was downloaded from [chromossome 22](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz), and was then filtered using vcftools with the following criteria:
1816

1917
* only biallelic, non-singleton SNV sites
2018
* SNvs must be at lest 2KB apart from each other
2119
* minor allele frequency < 0.05
2220

2321
The used command was:
2422

25-
./vcftools --gzvcf
26-
ALL.chr1.phase3_shapeit2_mvncall_integrated_v4.20130502.genotypes.vcf.gz
27-
--maf 0.05 --thin 2000 --min-alleles 2 --max-alleles 2 --non-ref-ac 2 --recode --chr 1 --out Chr1
23+
./vcftools --gzvcf \
24+
ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
25+
--maf 0.05 --thin 2000 --min-alleles 2 --max-alleles 2 --non-ref-ac 2 \
26+
--recode --chr 1 --out Chr1
2827

2928
This was the criteria that was used on the *admixture* [analysis of the 1000 genomes project](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/admixture_files/README.admixture_20141217).
3029

31-
The file was then converted to structure format with [PGDSpider](http://www.cmpg.unibe.ch/software/PGDSpider/), and compressed with xz.
30+
The file was then converted to structure format with [PGDSpider](http://www.cmpg.unibe.ch/software/PGDSpider/).
31+
To further reduce the dataset (for faster benchmarking), the file was then processed with `cut` and `head` and finally compressed with xz.
3232

33-
###Chr22.str.tar.xz
33+
The used commands were:
3434

35-
This file is similar to the one above, but it is from [chromossome 22](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) instead of chromossome 1. As such it contains fewer SNPs (2719).
35+
cut -d " " -f 1-604 BigData.str > BigData604SNPs.str
36+
head -n 1002 BigData604SNPs.str > BigTestData.str
37+
tar cvfJ BigTestData.str.tar.xz BigTestData.str
3638

37-
The file processing was done in the same way as for Chr1.
3839

39-
###extraparams and mainparams
40+
### extraparams and mainparams
4041

4142
The STRUCTURE paramater files that were used in the benchmarking process.
4243

43-
###joblist.txt
44+
### joblist.txt
4445

4546
The joblist used to benchmark *ParallelStructure*. Consists of 16 jobs, 4 values of "K" with 4 replicates each.
4647

47-
###TestData.structure
48+
### TestData.structure
4849

4950
This is the datafile itself that was used in the benchmarking process.
5051
It contains 83 individuals, divided in 17 populations, represented for 29 SNP loci.
51-
There is aproximately 13% missing data in the file.
52+
There is approximately 13% missing data in the file.

0 commit comments

Comments
 (0)