Skip to content

This repo contains the files needed to generate metadata for Branchwater metagenome data

Notifications You must be signed in to change notification settings

MGS-sails/Branchwater-Metadata

Repository files navigation

Building the duckdb database for the webapp can be done via bigquery. To build the db a bigquery service account is needed - it should be associated with a project 'sraproject', and a project needs to have a dataset called 'mastiffdata'.

Right now the query is setup to pull metadata for just 150,000 accessions to keep the db small while running locally for app troubleshooting. To remove this limit and pull metadata for all the accessions, remove 'LIMIT 150000' from L75 of bqtomongo.py.

Set up SRA metadata access

Based on the SRA instructions at Setting-up BigQuery

  1. Create project - sraproject

  2. Go to BigQuery search tool

    1. In the Explorer panel select +ADD to add data

    2. Select Star a project by name

    3. Search: nih-sra-datastore and select it

Create service account key

  1. Go to navigation menu -> IAM & Admin -> service accounts -> + CREATE SERVICE ACCOUNT

    • name: sraquery

    • ID: sraquery

      • NOTE - the name of the actual project id is autogenerated
    • Roles: BigQuery Job User; BigQuery Data Owner; BigQuery Read Sessions User

  2. Once the service account is created, click the menu bar under actions and choose Manage keys

    1. select add key

    2. create new key

    3. key type: JSON

    4. Download key to the bw_db/ folder

    5. save as bqKey.json

  3. In the BigQuery console, under sraproject create a dataset named mastiffdata

Dockerized metadata builder

This folder is now containerized. You can build and run a Docker image that:

  • builds the metadata parquet via BigQuery or via the public SRA metadata on S3, and
  • loads the parquet into a DuckDB file.

A persistent host directory is recommended for inputs/outputs and credentials. The container expects them at /data/bw_db.

1) Build the image

docker build -t branchwater-metadata .

2) Prepare local data directory

On the host, create a directory for inputs/outputs, e.g. $(pwd)/bw_db, and place your files there:

  • sraids — a text file with accession IDs (one per line)
  • bqKey.json — BigQuery service account JSON (only needed for the BigQuery flow)
mkdir -p bw_db
# cp your sraids file into bw_db/sraids
# cp your BigQuery key into bw_db/bqKey.json

3a) Build parquet via BigQuery

You must have access to the nih-sra-datastore project (starred in your BigQuery explorer) and a project/dataset per README above. Run:

docker run --rm \
  -v $(pwd)/bw_db:/data/bw_db \
  -e GOOGLE_APPLICATION_CREDENTIALS=/data/bw_db/bqKey.json \
  branchwater-metadata bq \
    --acc /data/bw_db/sraids \
    --output /data/bw_db/metadata.parquet \
    --limit   # remove this flag to build the full dataset

Notes:

  • You can omit the env var and instead pass --key-path /data/bw_db/bqKey.json explicitly.

3b) Build parquet via public SRA metadata on S3

This method doesn’t require BigQuery credentials. It may transfer a large amount of data on first use.

docker run --rm \
  -v $(pwd)/bw_db:/data/bw_db \
  branchwater-metadata sra \
    --acc /data/bw_db/sraids \
    --output /data/bw_db/metadata.parquet \
    --build-test-db   # remove this flag to build the full dataset

4) Load parquet into DuckDB

docker run --rm \
  -v $(pwd)/bw_db:/data/bw_db \
  branchwater-metadata duckdb /data/bw_db/metadata.parquet \
    --output /data/bw_db/metadata.duckdb --force

Default container help

docker run --rm branchwater-metadata --help

Notes

  • The container exposes /data/bw_db as a volume; bind-mount a host folder to persist outputs.
  • The BigQuery flow defaults to reading the key from /data/bw_db/bqKey.json and also respects GOOGLE_APPLICATION_CREDENTIALS.
  • The older note in this README about editing bqtomongo.py is obsolete here; use the --limit or --build-test-db flags to keep builds small while testing.
  • We use polars-lts-cpu (instead of polars) to avoid requiring AVX2/FMA CPU features; this improves compatibility on older CPUs and in some container environments.

About

This repo contains the files needed to generate metadata for Branchwater metagenome data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors