-
Notifications
You must be signed in to change notification settings - Fork 2
gnomAD Data Ingestion
gnomAD data exists either in a hail table or in very large VCF files. The hail tables offer easier access to the data we need we:
pip install hail
Then, create pyspark configuration options:
curl https://gist.githubusercontent.com/danking/f8387f5681b03edc5babdf36e14140bc/raw/23d43a2cc673d80adcc8f2a1daee6ab252d6f667/install-s3-connector.sh | bash
You'll need to edit the aws hadoop version to 3.2.2.
You may need to create the directory needed by this script:
mkdir /Users/ben/workspace/mavedb/sandbox/venv-mavedb-sandbox/lib/python3.9/site-packages/pyspark/conf
touch /Users/ben/workspace/mavedb/sandbox/venv-mavedb-sandbox/lib/python3.9/site-packages/pyspark/conf/spark-defaults.conf
Then, open an interpreter and open the hail table:
AWS_PROFILE=mavedb python3
>>> import hail as hl
>>> browser_table = hl.read_table("s3a://gnomad-public-us-east-1/release/4.1/ht/browser/gnomad.browser.v4.1.sites.ht")
>>> # we only care about the caid and joint data
>>> subset_table = browser_table.select(browser_table.caid, browser_table.joint)
At this point, we can output the table with reduced size in another file format (perhaps Parquet). To avoid unnecessary data storage charges, we should consider removing old versions.
>>> spark_table = subset_table.to_spark()
>>> spark_table.write.mode("overwrite").parquet("s3a://mavedb2-gnomad-data/v<GNOMAD_VERSION>")
Note this final step may take a very long time on dev machines.
Once the data is uploaded to S3, you'll need to run the AWS Glue Crawler on the S3 bucket to generate the Athena schema and prepare the table for query. You should include the version name in the new table name.
Then, simply select the new table in Athena and ensure you can query it successfully.