GitHub - MGS-sails/Branchwater-Metadata: This repo contains the files needed to generate metadata for Branchwater metagenome data

Building the duckdb database for the webapp can be done via bigquery. To build the db a bigquery service account is needed - it should be associated with a project 'sraproject', and a project needs to have a dataset called 'mastiffdata'.

Right now the query is setup to pull metadata for just 150,000 accessions to keep the db small while running locally for app troubleshooting. To remove this limit and pull metadata for all the accessions, remove 'LIMIT 150000' from L75 of bqtomongo.py.

Set up SRA metadata access

Based on the SRA instructions at Setting-up BigQuery

Create project - sraproject
Go to BigQuery search tool
1. In the Explorer panel select +ADD to add data
2. Select Star a project by name
3. Search: nih-sra-datastore and select it

Create service account key

Go to navigation menu -> IAM & Admin -> service accounts -> + CREATE SERVICE ACCOUNT
- name: sraquery
- ID: sraquery
  - NOTE - the name of the actual project id is autogenerated
- Roles: BigQuery Job User; BigQuery Data Owner; BigQuery Read Sessions User
Once the service account is created, click the menu bar under actions and choose Manage keys
1. select add key
2. create new key
3. key type: JSON
4. Download key to the bw_db/ folder
5. save as bqKey.json
In the BigQuery console, under sraproject create a dataset named mastiffdata

Dockerized metadata builder

This folder is now containerized. You can build and run a Docker image that:

builds the metadata parquet via BigQuery or via the public SRA metadata on S3, and
loads the parquet into a DuckDB file.

A persistent host directory is recommended for inputs/outputs and credentials. The container expects them at /data/bw_db.

1) Build the image

docker build -t branchwater-metadata .

2) Prepare local data directory

On the host, create a directory for inputs/outputs, e.g. $(pwd)/bw_db, and place your files there:

sraids — a text file with accession IDs (one per line)
bqKey.json — BigQuery service account JSON (only needed for the BigQuery flow)

mkdir -p bw_db
# cp your sraids file into bw_db/sraids
# cp your BigQuery key into bw_db/bqKey.json

3a) Build parquet via BigQuery

You must have access to the nih-sra-datastore project (starred in your BigQuery explorer) and a project/dataset per README above. Run:

docker run --rm \
  -v $(pwd)/bw_db:/data/bw_db \
  -e GOOGLE_APPLICATION_CREDENTIALS=/data/bw_db/bqKey.json \
  branchwater-metadata bq \
    --acc /data/bw_db/sraids \
    --output /data/bw_db/metadata.parquet \
    --limit   # remove this flag to build the full dataset

Notes:

You can omit the env var and instead pass --key-path /data/bw_db/bqKey.json explicitly.

3b) Build parquet via public SRA metadata on S3

This method doesn’t require BigQuery credentials. It may transfer a large amount of data on first use.

docker run --rm \
  -v $(pwd)/bw_db:/data/bw_db \
  branchwater-metadata sra \
    --acc /data/bw_db/sraids \
    --output /data/bw_db/metadata.parquet \
    --build-test-db   # remove this flag to build the full dataset

4) Load parquet into DuckDB

docker run --rm \
  -v $(pwd)/bw_db:/data/bw_db \
  branchwater-metadata duckdb /data/bw_db/metadata.parquet \
    --output /data/bw_db/metadata.duckdb --force

Default container help

docker run --rm branchwater-metadata --help

Notes

The container exposes /data/bw_db as a volume; bind-mount a host folder to persist outputs.
The BigQuery flow defaults to reading the key from /data/bw_db/bqKey.json and also respects GOOGLE_APPLICATION_CREDENTIALS.
The older note in this README about editing bqtomongo.py is obsolete here; use the --limit or --build-test-db flags to keep builds small while testing.
We use polars-lts-cpu (instead of polars) to avoid requiring AVX2/FMA CPU features; this improves compatibility on older CPUs and in some container environments.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.dockerignore		.dockerignore
Dockerfile		Dockerfile
README.md		README.md
attrcounts_4.5percent.csv		attrcounts_4.5percent.csv
load_duckdb.py		load_duckdb.py
prepare_bq.py		prepare_bq.py
prepare_sra.py		prepare_sra.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Set up SRA metadata access

Create service account key

Dockerized metadata builder

1) Build the image

2) Prepare local data directory

3a) Build parquet via BigQuery

3b) Build parquet via public SRA metadata on S3

4) Load parquet into DuckDB

Default container help

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

MGS-sails/Branchwater-Metadata

Folders and files

Latest commit

History

Repository files navigation

Set up SRA metadata access

Create service account key

Dockerized metadata builder

1) Build the image

2) Prepare local data directory

3a) Build parquet via BigQuery

3b) Build parquet via public SRA metadata on S3

4) Load parquet into DuckDB

Default container help

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages