Skip to content

gnomAD Data Ingestion

Benjamin Capodanno edited this page Jun 16, 2025 · 2 revisions

Data Upload to S3

gnomAD data exists either in a hail table or in very large VCF files. The hail tables offer easier access to the data we need we:

pip install hail

Then, create pyspark configuration options:

curl https://gist.githubusercontent.com/danking/f8387f5681b03edc5babdf36e14140bc/raw/23d43a2cc673d80adcc8f2a1daee6ab252d6f667/install-s3-connector.sh | bash

You'll need to edit the aws hadoop version to 3.2.2.

You may need to create the directory needed by this script:

mkdir /Users/ben/workspace/mavedb/sandbox/venv-mavedb-sandbox/lib/python3.9/site-packages/pyspark/conf
touch /Users/ben/workspace/mavedb/sandbox/venv-mavedb-sandbox/lib/python3.9/site-packages/pyspark/conf/spark-defaults.conf

Then, open an interpreter and open the hail table:

AWS_PROFILE=mavedb python3
>>> import hail as hl
>>> browser_table = hl.read_table("s3a://gnomad-public-us-east-1/release/4.1/ht/browser/gnomad.browser.v4.1.sites.ht")
>>> # we only care about the caid and joint data
>>> subset_table = browser_table.select(browser_table.caid, browser_table.joint)

At this point, we can output the table with reduced size in another file format (perhaps Parquet). To avoid unnecessary data storage charges, we should consider removing old versions.

>>> spark_table = subset_table.to_spark()
>>> spark_table.write.mode("overwrite").parquet("s3a://mavedb2-gnomad-data/v<GNOMAD_VERSION>")

Note this final step may take a very long time on dev machines.

Preparing data for Athena

Once the data is uploaded to S3, you'll need to run the AWS Glue Crawler on the S3 bucket to generate the Athena schema and prepare the table for query. You should include the version name in the new table name.

Then, simply select the new table in Athena and ensure you can query it successfully.

Clone this wiki locally