Improve database build documentation, prep v0.10 release

standage · standage · commit 9687e4b9cafa · 2023-09-15T14:03:18.000-04:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,7 +2,7 @@
 All notable changes to this project will be documented in this file.
 This project adheres to [Semantic Versioning](http://semver.org/).
 
-## Unreleased
+## [0.10] 2023-09-15
 
 ### Added
 - Marker definitions for NimaGen 29-plex (see #134).
diff --git a/dbbuild/README.md b/dbbuild/README.md
@@ -189,15 +189,29 @@ They can be installed using pip and/or conda.
 - selenium
 - geckodriver
 - intervaltree
+- tqdm
 
 
 ## Appendix B: Required Auxiliary Data files
 
-Run `./build.py --check` to see the expected locations of these files
-
 - dbSNP
-    - .vcf.gz, .vcf.gz.tbi, and .rsidx files
-    - GRCh37 and GRCh38
+    - .vcf.gz, .vcf.gz.tbi, and .rsidx files for both GRCh37 and GRCh38
+    - info on merged records
 - UCSC liftover chain files
     - hg19ToHg38
     - hg38ToHg19
+
+The following command will download data files required for the database build.
+
+```
+snakemake -c1 -p -s download.smk -d databases/
+```
+
+Following a successful run, the command `./build.py --check` can be used to verify that the files were downloaded correctly to the expected location.
+
+A dbSNP rsidx index must also be built for both GRCh37 and GRCh38 coordinates. Note that the following commands require many hours of run time.
+
+```
+rsidx index databases/dbSNP/dbSNP_GRCh37.vcf.gz databases/dbSNP/dbSNP_GRCh37.rsidx
+rsidx index databases/dbSNP/dbSNP_GRCh38.vcf.gz databases/dbSNP/dbSNP_GRCh38.rsidx
+```
diff --git a/dbbuild/download.smk b/dbbuild/download.smk
@@ -0,0 +1,98 @@
+# -------------------------------------------------------------------------------------------------
+# Copyright (c) 2023, DHS.
+#
+# This file is part of MicroHapDB (http://github.com/bioforensics/MicroHapDB) and is licensed under
+# the BSD license: see LICENSE.txt.
+#
+# This software was prepared for the Department of Homeland Security (DHS) by the Battelle National
+# Biodefense Institute, LLC (BNBI) as part of contract HSHQDC-15-C-00064 to manage and operate the
+# National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and
+# Development Center.
+# -------------------------------------------------------------------------------------------------
+
+import json
+import os
+import pandas as pd
+from tqdm import tqdm
+from urllib.request import urlretrieve
+
+accessions = {37: "GCF_000001405.25", 38: "GCF_000001405.40"}
+
+
+def progress(t):
+    """Stolen shamelessly from the tqdm documentation"""
+    last_b = [0]
+
+    def inner(b=1, bsize=1, tsize=None):
+        if tsize is not None:
+            t.total = tsize
+        t.update((b - last_b[0]) * bsize)
+        last_b[0] = b
+
+    return inner
+
+
+rule all:
+    input:
+        expand(
+            "dbSNP/dbSNP_GRCh{version}.{extension}",
+            version=(37, 38),
+            extension=("vcf.gz", "vcf.gz.tbi"),
+        ),
+        "dbSNP/refsnp-merged.csv.gz",
+        expand("{contrast}.over.chain.gz", contrast=("hg19ToHg38", "hg38ToHg19")),
+
+
+rule dbsnp:
+    output:
+        path="dbSNP/dbSNP_GRCh{version}.{extension}",
+    wildcard_constraints:
+        version="\d+",
+    run:
+        accession = accessions[int(wildcards.version)]
+        url_ext = wildcards.extension.replace("vcf.", "")
+        url = f"https://ftp.ncbi.nih.gov/snp/archive/b156/VCF/{accession}.{url_ext}"
+        origpath = Path("dbSNP") / Path(url).name
+        with tqdm(unit="B", unit_scale=True, leave=True, miniters=1, desc=str(origpath)) as t:
+            urlretrieve(url, origpath, reporthook=progress(t))
+        os.symlink(origpath.name, Path(output.path).name, dir_fd=os.open("dbSNP", os.O_RDONLY))
+
+
+rule merged:
+    output:
+        path="dbSNP/refsnp-merged.csv.gz",
+    run:
+        url = "https://ftp.ncbi.nih.gov/snp/archive/b156/JSON/refsnp-merged.json.bz2"
+        json_path = "dbSNP/refsnp-merged.json.bz2"
+        urlretrieve(url, json_path)
+        shell(f"bunzip2 {json_path}")
+        merged_rsids = dict()
+        updateint = 1e6
+        threshold = updateint
+        with open("dbSNP/refsnp-merged.json", "r") as instream:
+            for n, line in enumerate(instream):
+                try:
+                    data = json.loads(line)
+                except Exception:
+                    warn(f"Could not parse line {n+1}, skipping: {line}")
+                    continue
+                source = data["refsnp_id"]
+                targets = data["merged_snapshot_data"]["merged_into"]
+                for target in targets:
+                    merged_rsids[f"rs{source}"] = f"rs{target}"
+                if n >= threshold:
+                    threshold += updateint
+                    if threshold == updateint * 10:
+                        updateint = threshold
+                    print(f"processed {n} rows")
+        table = pd.DataFrame(merged_rsids.items(), columns=["Source", "Target"])
+        table.to_csv(output.path, index=False, compression="gzip")
+
+
+rule chain:
+    output:
+        path="{contrast}.over.chain.gz",
+    run:
+        source = wildcards.contrast[:4]
+        url = f"https://hgdownload.cse.ucsc.edu/goldenpath/{source}/liftOver/{wildcards.contrast}.over.chain.gz"
+        urlretrieve(url, output.path)
diff --git a/dbbuild/sources/byrskabishop2022/README.md b/dbbuild/sources/byrskabishop2022/README.md
@@ -0,0 +1,29 @@
+# NYGC 1000 Genomes Project Update
+
+## Citations
+
+Byrska-Bishop M, Evani US, Zhao X, et al. (2022) High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. *Cell* 185:18, [doi:10.1016/j.cell.2022.08.004](https://doi.org/10.1016/j.cell.2022.08.004).
+
+
+## Build process
+
+If `dbbuild/marker.csv` has been updated, copy that file to `marker-latest.csv` in this directory.
+
+```
+cp ../../marker.csv marker-latest.csv
+```
+
+Then haplotype frequencies and A<sub>e</sub> values can be recomputed as follows.
+
+```
+snakemake -c 8 --config 1kgp_dir=databases/1000Genomes/ refr=databases/hg38.fasta -p
+```
+
+
+## Required databases
+
+The data from this study to too large to bundle with the MicroHapDB git repository.
+Prior to (re-)building this data set, all VCF and TBI files must be downloaded from the following URL: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/.
+The files can be placed anywhere on your system, so long as the directory path is correctly specified in the `1kgp_dir` configuration as shown above.
+
+The GRCh38 human reference genome assembly should also be downloaded from the url https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, decompressed, and specified in the `refr` configuration as shown above.
diff --git a/docs/citations.md b/docs/citations.md
@@ -69,3 +69,5 @@ Contact: daniel.standage@st.dhs.gov.
 <sup>[27]</sup>Zhang R, Xue J, Tan M, Chen D, Xiao Y, Liu G, Zheng Y, Wu Q, Liao M, Lv M, Qu S, Liang W (2023) An MPS-Based 50plex Microhaplotype Assay for Forensic DNA Analysis. *Genes* 14:865, [doi:10.3390/genes14040865](https://doi.org/10.3390/genes14040865).
 
 <sup>[28]</sup>Du Q, Ma G, Lu C, Wang Q, Fu L, Cong B, Li S (2023) Development and evaluation of a novel panel containing 188 microhaplotypes for 2nd-degree kinship testing in the Hebei Han population. *FSI: Genetics* 65, [doi:10.1016/j.fsigen.2023.102855](https://doi.org/10.1016/j.fsigen.2023.102855).
+
+<sup>[29]</sup>Byrska-Bishop M, Evani US, Zhao X, et al. (2022) High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. *Cell* 185:18, [doi:10.1016/j.cell.2022.08.004](https://doi.org/10.1016/j.cell.2022.08.004).