Building the duckdb database for the webapp can be done via bigquery. To build the db a bigquery service account is needed - it should be associated with a project 'sraproject', and a project needs to have a dataset called 'mastiffdata'.
Right now the query is setup to pull metadata for just 150,000 accessions to keep the db small while running locally for app troubleshooting. To remove this limit and pull metadata for all the accessions, remove 'LIMIT 150000' from L75 of bqtomongo.py.
Based on the SRA instructions at Setting-up BigQuery
-
Create project -
sraproject -
Go to
BigQuerysearch tool-
In the
Explorer panelselect+ADDto add data -
Select
Star a project by name -
Search:
nih-sra-datastoreand select it
-
-
Go to
navigation menu->IAM & Admin->service accounts->+ CREATE SERVICE ACCOUNT-
name:
sraquery -
ID:
sraquery- NOTE - the name of the actual project id is autogenerated
-
Roles:
BigQuery Job User;BigQuery Data Owner;BigQuery Read Sessions User
-
-
Once the service account is created, click the menu bar under
actionsand chooseManage keys-
select
add key -
create new key -
key type:
JSON -
Download key to the
bw_db/folder -
save as
bqKey.json
-
-
In the BigQuery console, under
sraprojectcreate a dataset namedmastiffdata
This folder is now containerized. You can build and run a Docker image that:
- builds the metadata parquet via BigQuery or via the public SRA metadata on S3, and
- loads the parquet into a DuckDB file.
A persistent host directory is recommended for inputs/outputs and credentials. The container expects them at /data/bw_db.
docker build -t branchwater-metadata .
On the host, create a directory for inputs/outputs, e.g. $(pwd)/bw_db, and place your files there:
sraids— a text file with accession IDs (one per line)bqKey.json— BigQuery service account JSON (only needed for the BigQuery flow)
mkdir -p bw_db
# cp your sraids file into bw_db/sraids
# cp your BigQuery key into bw_db/bqKey.json
You must have access to the nih-sra-datastore project (starred in your BigQuery explorer) and a project/dataset per README above. Run:
docker run --rm \
-v $(pwd)/bw_db:/data/bw_db \
-e GOOGLE_APPLICATION_CREDENTIALS=/data/bw_db/bqKey.json \
branchwater-metadata bq \
--acc /data/bw_db/sraids \
--output /data/bw_db/metadata.parquet \
--limit # remove this flag to build the full dataset
Notes:
- You can omit the env var and instead pass
--key-path /data/bw_db/bqKey.jsonexplicitly.
This method doesn’t require BigQuery credentials. It may transfer a large amount of data on first use.
docker run --rm \
-v $(pwd)/bw_db:/data/bw_db \
branchwater-metadata sra \
--acc /data/bw_db/sraids \
--output /data/bw_db/metadata.parquet \
--build-test-db # remove this flag to build the full dataset
docker run --rm \
-v $(pwd)/bw_db:/data/bw_db \
branchwater-metadata duckdb /data/bw_db/metadata.parquet \
--output /data/bw_db/metadata.duckdb --force
docker run --rm branchwater-metadata --help
- The container exposes
/data/bw_dbas a volume; bind-mount a host folder to persist outputs. - The BigQuery flow defaults to reading the key from
/data/bw_db/bqKey.jsonand also respectsGOOGLE_APPLICATION_CREDENTIALS. - The older note in this README about editing
bqtomongo.pyis obsolete here; use the--limitor--build-test-dbflags to keep builds small while testing. - We use
polars-lts-cpu(instead ofpolars) to avoid requiring AVX2/FMA CPU features; this improves compatibility on older CPUs and in some container environments.