You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DRYAD_README.md
+48-1Lines changed: 48 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,6 +51,41 @@ Each admixed American individual (`{amr_ind}`) has two corresponding BED files,
51
51
- The BED files for the intersecting region of interest can be found [here](https://github.com/David-Peede/MUC19/tree/main/amr_lai/region_beds).
52
52
- Column 10: Number of base pairs of overlap with respect to the intersecting region of interest.
53
53
54
+
### `annotations`
55
+
56
+
#### `annotations.tar.gz`
57
+
`annotations.tar.gz` contains the output from `data_processing_v_revisions.ipynb` found [here](https://github.com/David-Peede/MUC19/blob/main/analyses_nbs/data_processing_v_revisions.ipynb) and the outputs from `consolidate_tgp_single_archaic_genes_v_revisions.py`, which can be found [here](https://github.com/David-Peede/MUC19/tree/main/annotations).
-`DEN`: Effective sequence length with respect to the Altai Denisovan.
84
+
-`S`: Number of segregating sites.
85
+
-`QC`: QC condition (numeric, e.g., 1 if true, 0 if false).
86
+
87
+
**Note**: If only column headers are present in the invariant file (`_invariant_`), no invariant windows passed initial QC.
88
+
54
89
### `arc_snp_density`
55
90
56
91
#### `classify_tgp_snps_chromosome.tar.gz`
@@ -533,15 +568,17 @@ For each late Neanderthal (`{cha,vin}`), there is one corresponding file per aut
533
568
For each autosome (`chr{1..22}`), there is one corresponding file. This file contains one row, where each column corresponds to the average number of pairwise differences between all African chromosomes in the 1000 Genomes Project and the Altai Denisovan's two chromosomes per non-overlapping 72kb window.
534
569
535
570
### `vcf_data`
536
-
Since all of the VCF files and bookkeeping information are well over 3TB, we provide the converted Zarr arrays, which are the output of `vcf_to_zarr.py` found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data). Additionally, in the following `windowing` subsubsection, we provide the corresponding bookkeeping files used in our analyses. The `{prefix}.tar.gz` files are named by the dataset (`{prefix}`) and, unless otherwise noted, contain the corresponding Zarr arrays for all autosomes (`chr{1..22}`).
571
+
Since ALL of the VCF files and bookkeeping information are well over 3TB, we provide the converted Zarr arrays, which are the output of `vcf_to_zarr_v_revisions.py` found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data), as 99.99% of analyses are performed using these data. We also include the final filtered variant VCF files used to generate these Zarr arrays. Additionally, in the following `windowing` subsubsection, we provide the corresponding bookkeeping files used in our analyses and include the raw bookkeeping files for the sake of completeness, which are located in `bookkeeping.tar.gz`. Note that all VCF files are formatted in accordance with the [VCF specification](https://samtools.github.io/hts-specs/VCFv4.2.pdf), and the format of the raw bookkeeping files is described in the VCF processing scripts found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data). The `{prefix}.tar.gz` files are named by the dataset (`{prefix}`) and, unless otherwise noted, contain the corresponding Zarr arrays for all autosomes (`chr{1..22}`).
537
572
538
573
**Datasets & Paths**
539
574
-`arcs_masked_no_aa.tar.gz`
540
575
- All archaic individuals without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan.
- All archaic individuals with ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
- 1000 Genomes Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: 1000 Genomes Project individuals (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan.
- `{den}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Altai Denisovan, without ancestral allele calls.
- `{vin}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Vindija Neanderthal, without ancestral allele calls.
- `{den}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), the Altai Denisovan, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
- `{vin}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), the Vindija Neanderthal, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
- Simon's Genome Diversity Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: Simon's Genome Diversity Project individuals (order as in [`./muc19/meta_data/sgdp.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan.
- `{den}`: Simon's Genome Diversity Project (order as in [`./muc19/meta_data/sgdp.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Altai Denisovan, without ancestral allele calls.
- `{vin}`: Simon's Genome Diversity Project (order as in [`./muc19/meta_data/sgdp.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Vindija Neanderthal, without ancestral allele calls.
- 1000 Genomes Project without ancestral allele calls. Individuals are in the same order as they appear in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data).
- 1000 Genomes Project with ancestral allele calls. Individuals are in the same order as they appear in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data), and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
- Corresponding output from `BEAGLE` before read-based phasing for the focal 72kb region, without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Altai Denisovan.
- Note that this is a single Zarr array for the focal 72kb region.
642
+
- The input VCF file can be found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data/phasing).
597
643
-`vin_phased_ref_panel_all_inds.tar.gz`
598
644
- Corresponding output from `BEAGLE` before read-based phasing for the focal 72kb region, without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Vindija Neanderthal, Altai Denisovan.
Copy file name to clipboardExpand all lines: README.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# The *MUC19*gene in Denisovans, Neanderthals, and Modern Humans: An Evolutionary History of Recurrent Introgression and Natural Selection
1
+
# The *MUC19*Gene: An Evolutionary History of Recurrent Introgression and Natural Selection
2
2
3
3
Code associated with Villanea and Peede et. al. 202X.
4
4
@@ -12,9 +12,11 @@ Code associated with Villanea and Peede et. al. 202X.
12
12
│ ├── dataframes
13
13
│ └── supp_tables
14
14
├── ancient_americans
15
+
├── annotations
15
16
├── arc_snp_density
16
17
├── figure_nbs
17
18
├── heterozygosity
19
+
├── hg38_data
18
20
├── hmmix_tracts
19
21
├── iHS
20
22
├── introgression
@@ -26,6 +28,7 @@ Code associated with Villanea and Peede et. al. 202X.
26
28
├── psuedo_ancestry_painting
27
29
├── sequence_divergence
28
30
├── vcf_data
31
+
│ ├── ann_summary
29
32
│ ├── muc19
30
33
│ └── phasing
31
34
├── vntr
@@ -36,8 +39,10 @@ Code associated with Villanea and Peede et. al. 202X.
36
39
37
40
-`amr_lai`
38
41
-`ancient_americans`
42
+
-`annotations`
39
43
-`arc_snp_density`
40
44
-`heterozygosity`
45
+
-`hg38_data`
41
46
-`hmmix_tracts`
42
47
-`iHS`
43
48
-`introgression`
@@ -53,7 +58,8 @@ Code associated with Villanea and Peede et. al. 202X.
53
58
54
59
-`analyses_nbs`
55
60
-`figure_nbs`
61
+
-`meta_data`
56
62
57
63
### Notes
58
64
59
-
As this code is associated with a manuscript that is currently going through the peer-review process and due to the fact that GitHub only stores up to 100 Mb per repository some directories are compressed using `.tar.gz` or `.zip`. To extract the files those directories you will need to run either `tar -xf target_directory.tar.gz` or `unzip target_directory.zip`. Once the associated manuscript has been peer-reviewed all output files will be uploaded to the appropriate locations.
65
+
This repository contains all code, meta information, and final results. Key datasets of interest have been uploaded to Zenodo (see `./ZENODO_README.md`). All of the intermediary data used to generate the final set of results in `./analyses_nbs/dataframes` have been uploaded to DRYAD (see `./DRYAD_README.md`). As this code is associated with a manuscript that is currently going through the peer-review process, and due to the fact that GitHub only stores up to 100Mb per repository, some directories are compressed using `.tar.gz` or `.zip`. To extract the files from those directories, you will need to run either `tar -xf target_directory.tar.gz` or `unzip target_directory.zip`.
0 commit comments