Skip to content

Commit 2cf1c2c

Browse files
committed
Everything is up-to-date and squeaky clean 🧼
1 parent 2688a96 commit 2cf1c2c

File tree

5 files changed

+57
-4
lines changed

5 files changed

+57
-4
lines changed

.DS_Store

0 Bytes
Binary file not shown.

DRYAD_README.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,41 @@ Each admixed American individual (`{amr_ind}`) has two corresponding BED files,
5151
- The BED files for the intersecting region of interest can be found [here](https://github.com/David-Peede/MUC19/tree/main/amr_lai/region_beds).
5252
- Column 10: Number of base pairs of overlap with respect to the intersecting region of interest.
5353

54+
### `annotations`
55+
56+
#### `annotations.tar.gz`
57+
`annotations.tar.gz` contains the output from `data_processing_v_revisions.ipynb` found [here](https://github.com/David-Peede/MUC19/blob/main/analyses_nbs/data_processing_v_revisions.ipynb) and the outputs from `consolidate_tgp_single_archaic_genes_v_revisions.py`, which can be found [here](https://github.com/David-Peede/MUC19/tree/main/annotations).
58+
59+
**Paths**
60+
- `annotations.tar.gz`
61+
- `./muc19/annotations/hg19_genes/ncbi_refseq_genes_chr{1..22}.csv.gz`
62+
- `./muc19/annotations/hg19_genes/ncbi_refseq_transcripts.txt`
63+
- `./muc19/annotations/tgp_den_masked_no_aa/ncbi_refseq_{invariant,variant}_genes.csv.gz`
64+
65+
##### `ncbi_refseq_genes_chr{1..22}.csv.gz`
66+
One file per autosome (`{1..22}`) containing the NCBI RefSeq Select gene coordinates with the following columns:
67+
- `GENE_ID`: NCBI RefSeq Select gene ID.
68+
- `TRANSCRIPT_ID`: NCBI RefSeq Select transcript ID.
69+
- `START`: Start position (inclusive).
70+
- `STOP`: Stop position (inclusive).
71+
72+
##### `ncbi_refseq_transcripts.txt`
73+
One file with a single column listing all of the autosomal NCBI RefSeq Select transcript IDs used for the `SnpEff` annotations.
74+
75+
##### `ncbi_refseq_{invariant,variant}_genes.csv.gz`
76+
One file for invariant (`_invariant_`) and one for variant (`_variant_`) NCBI RefSeq Select genes, with the following columns:
77+
- `IDX`: Gene index with respect to the `CHR` column.
78+
- `GENE_ID`: NCBI RefSeq Select gene ID.
79+
- `TRANSCRIPT_ID`: NCBI RefSeq Select transcript ID.
80+
- `CHR`: Chromosome.
81+
- `START`: Start position (inclusive).
82+
- `STOP`: Stop position (inclusive).
83+
- `DEN`: Effective sequence length with respect to the Altai Denisovan.
84+
- `S`: Number of segregating sites.
85+
- `QC`: QC condition (numeric, e.g., 1 if true, 0 if false).
86+
87+
**Note**: If only column headers are present in the invariant file (`_invariant_`), no invariant windows passed initial QC.
88+
5489
### `arc_snp_density`
5590

5691
#### `classify_tgp_snps_chromosome.tar.gz`
@@ -533,15 +568,17 @@ For each late Neanderthal (`{cha,vin}`), there is one corresponding file per aut
533568
For each autosome (`chr{1..22}`), there is one corresponding file. This file contains one row, where each column corresponds to the average number of pairwise differences between all African chromosomes in the 1000 Genomes Project and the Altai Denisovan's two chromosomes per non-overlapping 72kb window.
534569

535570
### `vcf_data`
536-
Since all of the VCF files and bookkeeping information are well over 3TB, we provide the converted Zarr arrays, which are the output of `vcf_to_zarr.py` found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data). Additionally, in the following `windowing` subsubsection, we provide the corresponding bookkeeping files used in our analyses. The `{prefix}.tar.gz` files are named by the dataset (`{prefix}`) and, unless otherwise noted, contain the corresponding Zarr arrays for all autosomes (`chr{1..22}`).
571+
Since ALL of the VCF files and bookkeeping information are well over 3TB, we provide the converted Zarr arrays, which are the output of `vcf_to_zarr_v_revisions.py` found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data), as 99.99% of analyses are performed using these data. We also include the final filtered variant VCF files used to generate these Zarr arrays. Additionally, in the following `windowing` subsubsection, we provide the corresponding bookkeeping files used in our analyses and include the raw bookkeeping files for the sake of completeness, which are located in `bookkeeping.tar.gz`. Note that all VCF files are formatted in accordance with the [VCF specification](https://samtools.github.io/hts-specs/VCFv4.2.pdf), and the format of the raw bookkeeping files is described in the VCF processing scripts found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data). The `{prefix}.tar.gz` files are named by the dataset (`{prefix}`) and, unless otherwise noted, contain the corresponding Zarr arrays for all autosomes (`chr{1..22}`).
537572

538573
**Datasets & Paths**
539574
- `arcs_masked_no_aa.tar.gz`
540575
- All archaic individuals without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan.
541576
- `./muc19/zarr_data/arcs_masked_no_aa_chr{1..22}.zarr`
577+
- Input VCF files: `all_arcs_filtered_merge.tar.gz`.
542578
- `arcs_masked_aa.tar.gz`
543579
- All archaic individuals with ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
544580
- `./muc19/zarr_data/arcs_masked_no_aa_chr{1..22}.zarr`
581+
- Input VCF files: `all_arcs_filtered_merge.tar.gz`.
545582
- `{den,alt,cha,vin}_masked_no_aa.tar.gz`
546583
- `{den}`: Altai Denisovan without ancestral allele calls.
547584
- `./muc19/zarr_data/den_masked_no_aa_chr{1..22}.zarr`
@@ -551,9 +588,11 @@ Since all of the VCF files and bookkeeping information are well over 3TB, we pro
551588
- `./muc19/zarr_data/cha_masked_no_aa_chr{1..22}.zarr`
552589
- `{vin}`: Vindija Neanderthal without ancestral allele calls.
553590
- `./muc19/zarr_data/vin_masked_no_aa_chr{1..22}.zarr`
591+
- Input VCF files: `single_arc_filtered_merge.tar.gz`.
554592
- `tgp_arcs_masked_no_aa.tar.gz`
555593
- 1000 Genomes Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: 1000 Genomes Project individuals (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan.
556594
- `./muc19/zarr_data/tgp_arcs_masked_no_aa_chr{1..22}.zarr`
595+
- Input VCF files: `tgp_sgdp_all_arcs_filtered_merge.tar.gz`.
557596
- `tgp_{den,alt,cha,vin}_masked_no_aa.tar.gz`
558597
- `{den}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Altai Denisovan, without ancestral allele calls.
559598
- `./muc19/zarr_data/tgp_den_masked_no_aa_chr{1..22}.zarr`
@@ -563,6 +602,7 @@ Since all of the VCF files and bookkeeping information are well over 3TB, we pro
563602
- `./muc19/zarr_data/tgp_cha_masked_no_aa_chr{1..22}.zarr`
564603
- `{vin}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Vindija Neanderthal, without ancestral allele calls.
565604
- `./muc19/zarr_data/tgp_vin_masked_no_aa_chr{1..22}.zarr`
605+
- Input VCF files: `tgp_single_arc_filtered_merge.tar.gz`.
566606
- `tgp_{den,alt,cha,vin}_masked_aa.tar.gz`
567607
- `{den}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), the Altai Denisovan, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
568608
- `./muc19/zarr_data/tgp_den_masked_aa_chr{1..22}.zarr`
@@ -572,9 +612,11 @@ Since all of the VCF files and bookkeeping information are well over 3TB, we pro
572612
- `./muc19/zarr_data/tgp_cha_masked_aa_chr{1..22}.zarr`
573613
- `{vin}`: 1000 Genomes Project (order as in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), the Vindija Neanderthal, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
574614
- `./muc19/zarr_data/tgp_vin_masked_aa_chr{1..22}.zarr`
615+
- Input VCF files: `tgp_single_arc_filtered_merge.tar.gz`.
575616
- `sgdp_arcs_masked_no_aa.tar.gz`
576617
- Simon's Genome Diversity Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: Simon's Genome Diversity Project individuals (order as in [`./muc19/meta_data/sgdp.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)), Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan.
577618
- `./muc19/zarr_data/sgdp_arcs_masked_no_aa_chr{1..22}.zarr`
619+
- Input VCF files: `tgp_sgdp_all_arcs_filtered_merge.tar.gz`.
578620
- `sgdp_{den,alt,cha,vin}_masked_no_aa.tar.gz`
579621
- `{den}`: Simon's Genome Diversity Project (order as in [`./muc19/meta_data/sgdp.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Altai Denisovan, without ancestral allele calls.
580622
- `./muc19/zarr_data/sgdp_den_masked_no_aa_chr{1..22}.zarr`
@@ -584,20 +626,25 @@ Since all of the VCF files and bookkeeping information are well over 3TB, we pro
584626
- `./muc19/zarr_data/sgdp_cha_masked_no_aa_chr{1..22}.zarr`
585627
- `{vin}`: Simon's Genome Diversity Project (order as in [`./muc19/meta_data/sgdp.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data)) and the Vindija Neanderthal, without ancestral allele calls.
586628
- `./muc19/zarr_data/sgdp_vin_masked_no_aa_chr{1..22}.zarr`
629+
- Input VCF files: `sgdp_single_arc_filtered_merge.tar.gz`.
587630
- `tgp_mod_no_aa.tar.gz`
588631
- 1000 Genomes Project without ancestral allele calls. Individuals are in the same order as they appear in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data).
589632
- `./muc19/zarr_data/tgp_mod_no_aa_chr{1..22}.zarr`
633+
- Input VCF files: `tgp_mod_filtered_merge.tar.gz`.
590634
- `tgp_mod_aa.tar.gz`
591635
- 1000 Genomes Project with ancestral allele calls. Individuals are in the same order as they appear in [`./muc19/meta_data/tgp_mod.txt`](https://github.com/David-Peede/MUC19/tree/main/meta_data), and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
592636
- `./muc19/zarr_data/tgp_mod_aa_chr{1..22}.zarr`
637+
- Input VCF files: `tgp_mod_filtered_merge.tar.gz`.
593638
- `cha_phased_ref_panel_all_inds.tar.gz`
594639
- Corresponding output from `BEAGLE` before read-based phasing for the focal 72kb region, without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Altai Denisovan.
595640
- `./muc19/zarr_data/cha_phased_ref_panel_all_inds.zarr`
596641
- Note that this is a single Zarr array for the focal 72kb region.
642+
- The input VCF file can be found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data/phasing).
597643
- `vin_phased_ref_panel_all_inds.tar.gz`
598644
- Corresponding output from `BEAGLE` before read-based phasing for the focal 72kb region, without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Vindija Neanderthal, Altai Denisovan.
599645
- `./muc19/zarr_data/vin_phased_ref_panel_all_inds.zarr`
600646
- Note that this is a single Zarr array for the focal 72kb region.
647+
- The input VCF file can be found [here](https://github.com/David-Peede/MUC19/tree/main/vcf_data/phasing).
601648

602649
### `windowing`
603650

README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# The *MUC19* gene in Denisovans, Neanderthals, and Modern Humans: An Evolutionary History of Recurrent Introgression and Natural Selection
1+
# The *MUC19* Gene: An Evolutionary History of Recurrent Introgression and Natural Selection
22

33
Code associated with Villanea and Peede et. al. 202X.
44

@@ -12,9 +12,11 @@ Code associated with Villanea and Peede et. al. 202X.
1212
│   ├── dataframes
1313
│   └── supp_tables
1414
├── ancient_americans
15+
├── annotations
1516
├── arc_snp_density
1617
├── figure_nbs
1718
├── heterozygosity
19+
├── hg38_data
1820
├── hmmix_tracts
1921
├── iHS
2022
├── introgression
@@ -26,6 +28,7 @@ Code associated with Villanea and Peede et. al. 202X.
2628
├── psuedo_ancestry_painting
2729
├── sequence_divergence
2830
├── vcf_data
31+
│   ├── ann_summary
2932
│   ├── muc19
3033
│   └── phasing
3134
├── vntr
@@ -36,8 +39,10 @@ Code associated with Villanea and Peede et. al. 202X.
3639

3740
- `amr_lai`
3841
- `ancient_americans`
42+
- `annotations`
3943
- `arc_snp_density`
4044
- `heterozygosity`
45+
- `hg38_data`
4146
- `hmmix_tracts`
4247
- `iHS`
4348
- `introgression`
@@ -53,7 +58,8 @@ Code associated with Villanea and Peede et. al. 202X.
5358

5459
- `analyses_nbs`
5560
- `figure_nbs`
61+
- `meta_data`
5662

5763
### Notes
5864

59-
As this code is associated with a manuscript that is currently going through the peer-review process and due to the fact that GitHub only stores up to 100 Mb per repository some directories are compressed using `.tar.gz` or `.zip`. To extract the files those directories you will need to run either `tar -xf target_directory.tar.gz` or `unzip target_directory.zip`. Once the associated manuscript has been peer-reviewed all output files will be uploaded to the appropriate locations.
65+
This repository contains all code, meta information, and final results. Key datasets of interest have been uploaded to Zenodo (see `./ZENODO_README.md`). All of the intermediary data used to generate the final set of results in `./analyses_nbs/dataframes` have been uploaded to DRYAD (see `./DRYAD_README.md`). As this code is associated with a manuscript that is currently going through the peer-review process, and due to the fact that GitHub only stores up to 100Mb per repository, some directories are compressed using `.tar.gz` or `.zip`. To extract the files from those directories, you will need to run either `tar -xf target_directory.tar.gz` or `unzip target_directory.zip`.

analyses_nbs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ All packages are publicly available and their documentation can be viewed at the
1414

1515
## Code
1616

17-
__Run the code necessary to generate any other input files, which can be found in the `README.md` of each directory.__
17+
__Run the code necessary to generate the input files, which can be found in the `README.md` of each data generation directory.__
1818

1919
- Supplemental tables can be viewed in the `supp_tables` directory.
2020
- All results from the notebooks above can be found in the `dataframes` directory.

pbs_sims/.DS_Store

0 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)