Skip to content

Commit 9687e4b

Browse files
committed
Improve database build documentation, prep v0.10 release
1 parent ad5ee77 commit 9687e4b

File tree

5 files changed

+148
-5
lines changed

5 files changed

+148
-5
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
All notable changes to this project will be documented in this file.
33
This project adheres to [Semantic Versioning](http://semver.org/).
44

5-
## Unreleased
5+
## [0.10] 2023-09-15
66

77
### Added
88
- Marker definitions for NimaGen 29-plex (see #134).

dbbuild/README.md

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -189,15 +189,29 @@ They can be installed using pip and/or conda.
189189
- selenium
190190
- geckodriver
191191
- intervaltree
192+
- tqdm
192193

193194

194195
## Appendix B: Required Auxiliary Data files
195196

196-
Run `./build.py --check` to see the expected locations of these files
197-
198197
- dbSNP
199-
- .vcf.gz, .vcf.gz.tbi, and .rsidx files
200-
- GRCh37 and GRCh38
198+
- .vcf.gz, .vcf.gz.tbi, and .rsidx files for both GRCh37 and GRCh38
199+
- info on merged records
201200
- UCSC liftover chain files
202201
- hg19ToHg38
203202
- hg38ToHg19
203+
204+
The following command will download data files required for the database build.
205+
206+
```
207+
snakemake -c1 -p -s download.smk -d databases/
208+
```
209+
210+
Following a successful run, the command `./build.py --check` can be used to verify that the files were downloaded correctly to the expected location.
211+
212+
A dbSNP rsidx index must also be built for both GRCh37 and GRCh38 coordinates. Note that the following commands require many hours of run time.
213+
214+
```
215+
rsidx index databases/dbSNP/dbSNP_GRCh37.vcf.gz databases/dbSNP/dbSNP_GRCh37.rsidx
216+
rsidx index databases/dbSNP/dbSNP_GRCh38.vcf.gz databases/dbSNP/dbSNP_GRCh38.rsidx
217+
```

dbbuild/download.smk

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# -------------------------------------------------------------------------------------------------
2+
# Copyright (c) 2023, DHS.
3+
#
4+
# This file is part of MicroHapDB (http://github.com/bioforensics/MicroHapDB) and is licensed under
5+
# the BSD license: see LICENSE.txt.
6+
#
7+
# This software was prepared for the Department of Homeland Security (DHS) by the Battelle National
8+
# Biodefense Institute, LLC (BNBI) as part of contract HSHQDC-15-C-00064 to manage and operate the
9+
# National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and
10+
# Development Center.
11+
# -------------------------------------------------------------------------------------------------
12+
13+
import json
14+
import os
15+
import pandas as pd
16+
from tqdm import tqdm
17+
from urllib.request import urlretrieve
18+
19+
accessions = {37: "GCF_000001405.25", 38: "GCF_000001405.40"}
20+
21+
22+
def progress(t):
23+
"""Stolen shamelessly from the tqdm documentation"""
24+
last_b = [0]
25+
26+
def inner(b=1, bsize=1, tsize=None):
27+
if tsize is not None:
28+
t.total = tsize
29+
t.update((b - last_b[0]) * bsize)
30+
last_b[0] = b
31+
32+
return inner
33+
34+
35+
rule all:
36+
input:
37+
expand(
38+
"dbSNP/dbSNP_GRCh{version}.{extension}",
39+
version=(37, 38),
40+
extension=("vcf.gz", "vcf.gz.tbi"),
41+
),
42+
"dbSNP/refsnp-merged.csv.gz",
43+
expand("{contrast}.over.chain.gz", contrast=("hg19ToHg38", "hg38ToHg19")),
44+
45+
46+
rule dbsnp:
47+
output:
48+
path="dbSNP/dbSNP_GRCh{version}.{extension}",
49+
wildcard_constraints:
50+
version="\d+",
51+
run:
52+
accession = accessions[int(wildcards.version)]
53+
url_ext = wildcards.extension.replace("vcf.", "")
54+
url = f"https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/{accession}.{url_ext}"
55+
origpath = Path("dbSNP") / Path(url).name
56+
with tqdm(unit="B", unit_scale=True, leave=True, miniters=1, desc=str(origpath)) as t:
57+
urlretrieve(url, origpath, reporthook=progress(t))
58+
os.symlink(origpath.name, Path(output.path).name, dir_fd=os.open("dbSNP", os.O_RDONLY))
59+
60+
61+
rule merged:
62+
output:
63+
path="dbSNP/refsnp-merged.csv.gz",
64+
run:
65+
url = "https://ftp.ncbi.nih.gov/snp/archive/b156/JSON/refsnp-merged.json.bz2"
66+
json_path = "dbSNP/refsnp-merged.json.bz2"
67+
urlretrieve(url, json_path)
68+
shell(f"bunzip2 {json_path}")
69+
merged_rsids = dict()
70+
updateint = 1e6
71+
threshold = updateint
72+
with open("dbSNP/refsnp-merged.json", "r") as instream:
73+
for n, line in enumerate(instream):
74+
try:
75+
data = json.loads(line)
76+
except Exception:
77+
warn(f"Could not parse line {n+1}, skipping: {line}")
78+
continue
79+
source = data["refsnp_id"]
80+
targets = data["merged_snapshot_data"]["merged_into"]
81+
for target in targets:
82+
merged_rsids[f"rs{source}"] = f"rs{target}"
83+
if n >= threshold:
84+
threshold += updateint
85+
if threshold == updateint * 10:
86+
updateint = threshold
87+
print(f"processed {n} rows")
88+
table = pd.DataFrame(merged_rsids.items(), columns=["Source", "Target"])
89+
table.to_csv(output.path, index=False, compression="gzip")
90+
91+
92+
rule chain:
93+
output:
94+
path="{contrast}.over.chain.gz",
95+
run:
96+
source = wildcards.contrast[:4]
97+
url = f"https://hgdownload.cse.ucsc.edu/goldenpath/{source}/liftOver/{wildcards.contrast}.over.chain.gz"
98+
urlretrieve(url, output.path)
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# NYGC 1000 Genomes Project Update
2+
3+
## Citations
4+
5+
Byrska-Bishop M, Evani US, Zhao X, et al. (2022) High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. *Cell* 185:18, [doi:10.1016/j.cell.2022.08.004](https://doi.org/10.1016/j.cell.2022.08.004).
6+
7+
8+
## Build process
9+
10+
If `dbbuild/marker.csv` has been updated, copy that file to `marker-latest.csv` in this directory.
11+
12+
```
13+
cp ../../marker.csv marker-latest.csv
14+
```
15+
16+
Then haplotype frequencies and A<sub>e</sub> values can be recomputed as follows.
17+
18+
```
19+
snakemake -c 8 --config 1kgp_dir=databases/1000Genomes/ refr=databases/hg38.fasta -p
20+
```
21+
22+
23+
## Required databases
24+
25+
The data from this study to too large to bundle with the MicroHapDB git repository.
26+
Prior to (re-)building this data set, all VCF and TBI files must be downloaded from the following URL: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/.
27+
The files can be placed anywhere on your system, so long as the directory path is correctly specified in the `1kgp_dir` configuration as shown above.
28+
29+
The GRCh38 human reference genome assembly should also be downloaded from the url https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, decompressed, and specified in the `refr` configuration as shown above.

docs/citations.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,5 @@ Contact: daniel.standage@st.dhs.gov.
6969
<sup>[27]</sup>Zhang R, Xue J, Tan M, Chen D, Xiao Y, Liu G, Zheng Y, Wu Q, Liao M, Lv M, Qu S, Liang W (2023) An MPS-Based 50plex Microhaplotype Assay for Forensic DNA Analysis. *Genes* 14:865, [doi:10.3390/genes14040865](https://doi.org/10.3390/genes14040865).
7070

7171
<sup>[28]</sup>Du Q, Ma G, Lu C, Wang Q, Fu L, Cong B, Li S (2023) Development and evaluation of a novel panel containing 188 microhaplotypes for 2nd-degree kinship testing in the Hebei Han population. *FSI: Genetics* 65, [doi:10.1016/j.fsigen.2023.102855](https://doi.org/10.1016/j.fsigen.2023.102855).
72+
73+
<sup>[29]</sup>Byrska-Bishop M, Evani US, Zhao X, et al. (2022) High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. *Cell* 185:18, [doi:10.1016/j.cell.2022.08.004](https://doi.org/10.1016/j.cell.2022.08.004).

0 commit comments

Comments
 (0)