|
1 | | -#Test Data for *Structure_threader* |
| 1 | +# Test Data for *Structure_threader* |
2 | 2 |
|
3 | 3 | In this directory you will find the data that was used to benchmark *Structure_threader*. |
4 | 4 |
|
5 | | -##Contents (in alphabetical order): |
| 5 | +## Contents (in alphabetical order): |
6 | 6 |
|
7 | | -* Chr1.str.tar.xz |
8 | | -* Chr22.str.tar.xz |
9 | | -* benchmark.sh |
| 7 | +* BigTestData.str.tar.xz |
10 | 8 | * extraparams |
11 | 9 | * joblist.txt |
12 | 10 | * mainparams |
13 | 11 | * TestData.structure |
14 | 12 |
|
15 | | -###Chr1.str.tar.xz |
| 13 | +### BigTestData.str.tar.xz |
16 | 14 |
|
17 | | -This file is a fastStructure formatted input file which was used to benchmark fastStructure. This is a **huge** SNP file (16854 SNPs) which was obtained from the [1000 genomes project](http://www.1000genomes.org). The file was downloaded from [here](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz), and was then filtered using vcftools with the following criteria: |
| 15 | +This file is a fastStructure formatted input file which was used to benchmark fastStructure. This is a large SNP file (604 SNPs) which was obtained from the [1000 genomes project](http://www.1000genomes.org). The file was downloaded from [chromossome 22](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz), and was then filtered using vcftools with the following criteria: |
18 | 16 |
|
19 | 17 | * only biallelic, non-singleton SNV sites |
20 | 18 | * SNvs must be at lest 2KB apart from each other |
21 | 19 | * minor allele frequency < 0.05 |
22 | 20 |
|
23 | 21 | The used command was: |
24 | 22 |
|
25 | | -./vcftools --gzvcf |
26 | | -ALL.chr1.phase3_shapeit2_mvncall_integrated_v4.20130502.genotypes.vcf.gz |
27 | | ---maf 0.05 --thin 2000 --min-alleles 2 --max-alleles 2 --non-ref-ac 2 --recode --chr 1 --out Chr1 |
| 23 | + ./vcftools --gzvcf \ |
| 24 | + ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \ |
| 25 | + --maf 0.05 --thin 2000 --min-alleles 2 --max-alleles 2 --non-ref-ac 2 \ |
| 26 | + --recode --chr 1 --out Chr1 |
28 | 27 |
|
29 | 28 | This was the criteria that was used on the *admixture* [analysis of the 1000 genomes project](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/admixture_files/README.admixture_20141217). |
30 | 29 |
|
31 | | -The file was then converted to structure format with [PGDSpider](http://www.cmpg.unibe.ch/software/PGDSpider/), and compressed with xz. |
| 30 | +The file was then converted to structure format with [PGDSpider](http://www.cmpg.unibe.ch/software/PGDSpider/). |
| 31 | +To further reduce the dataset (for faster benchmarking), the file was then processed with `cut` and `head` and finally compressed with xz. |
32 | 32 |
|
33 | | -###Chr22.str.tar.xz |
| 33 | +The used commands were: |
34 | 34 |
|
35 | | -This file is similar to the one above, but it is from [chromossome 22](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) instead of chromossome 1. As such it contains fewer SNPs (2719). |
| 35 | + cut -d " " -f 1-604 BigData.str > BigData604SNPs.str |
| 36 | + head -n 1002 BigData604SNPs.str > BigTestData.str |
| 37 | + tar cvfJ BigTestData.str.tar.xz BigTestData.str |
36 | 38 |
|
37 | | -The file processing was done in the same way as for Chr1. |
38 | 39 |
|
39 | | -###extraparams and mainparams |
| 40 | +### extraparams and mainparams |
40 | 41 |
|
41 | 42 | The STRUCTURE paramater files that were used in the benchmarking process. |
42 | 43 |
|
43 | | -###joblist.txt |
| 44 | +### joblist.txt |
44 | 45 |
|
45 | 46 | The joblist used to benchmark *ParallelStructure*. Consists of 16 jobs, 4 values of "K" with 4 replicates each. |
46 | 47 |
|
47 | | -###TestData.structure |
| 48 | +### TestData.structure |
48 | 49 |
|
49 | 50 | This is the datafile itself that was used in the benchmarking process. |
50 | 51 | It contains 83 individuals, divided in 17 populations, represented for 29 SNP loci. |
51 | | -There is aproximately 13% missing data in the file. |
| 52 | +There is approximately 13% missing data in the file. |
0 commit comments