OpenAlex data to Google Bigquery

OpenAlex data require some modifications before it can be uploaded as columnar data to BQ. Namely, all hyphens in tag names need to be removed and missing arrays should be added.

The source data is hosted on AWS at https://openalex.s3.amazonaws.com/.

This repo provides node.js scripts and instructions to convert and upload the data. It uses pixi for reproducible environment management.

Available datasets

authors
awards
concepts
domains
fields
funders
institutions
publishers
sources
subfields
topics
works

Instructions

Configure settings

Copy .env.example to .env and fill in your values:

VERSION="20260210"          # Table version suffix (yyyymmdd)
PROJECT_ID="your-project-id"
GCS_BUCKET="your-bucket-name"
BQ_DATASET="openalex"

Set up the environment with pixi

pixi install
pixi run setup

Download data from AWS

Download one or more datasets using the download task (see list above).

pixi run download works
pixi run download works authors
pixi run download all        # download everything

Convert files

Only works, concepts, institutions, and sources require conversion (fixing null arrays, stringifying inverted index). All other datasets can be uploaded as-is. Already-converted files are skipped on re-run, so it's safe to interrupt and resume.

For large datasets like works (~600GB), you can process in batches by appending a range. Folders are sorted alphabetically; use count to see the total. Files are processed concurrently by default (number of CPUs). Use --parallel=N to set the number of workers.

pixi run convert works count       # show number of folders
pixi run convert works 1-50        # convert folders 1-50
pixi run convert works 51-100      # convert folders 51-100
pixi run convert --parallel=4 works 1-50  # 4 workers, folders 1-50
pixi run convert works             # all folders
pixi run convert all               # convert works, concepts, institutions, sources

Generate schemas

Uses bigquery-schema-generator to generate BQ schemas from the data. Prefers converted data when available, otherwise falls back to raw. If a schema file already exists, it is updated incrementally (new fields are merged in).

Use -m N to limit the number of files sampled per dataset (useful for quick schema drafts).

pixi run schema authors
pixi run schema all              # generate all schemas
pixi run schema -m 5 works       # sample at most 5 files

Validate record counts

Checks that converted files match the record counts in the OpenAlex manifest. Uses parallel decompression (cores/2) and pigz when available for faster validation.

Results are saved to data/validation/<VERSION>/<dataset>.tsv as each file completes, so interrupted runs resume where they left off. Delete the .tsv file to re-validate from scratch.

pixi run validate works
pixi run validate works concepts
pixi run validate all            # validate everything

Upload to GCS

Uploads files to GCS, tracking progress in data/upload/<VERSION>/<dataset>.tsv. Already-uploaded files (status OK) are skipped on re-run; failed uploads are retried automatically.

Use -m N to limit to N new files per dataset.

pixi run upload works
pixi run upload -m 5 works       # upload at most 5 new files
pixi run upload all              # upload everything

Load into BigQuery

Creates tables with versioned names (e.g. works_20260210).

pixi run bq works
pixi run bq works concepts
pixi run bq all              # load everything

Make tables public

Grants allUsers read access (roles/bigquery.dataViewer) to BigQuery tables.

pixi run public works
pixi run public all              # all tables

Cleanup

Remove local data files for the current version:

pixi run cleanup

Move GCS files to archive storage class:

pixi run gcs-archive

Notes

Inverted Abstracts are converted to strings.
Concepts.international is dropped because it was a headache to deal with.

Enjoy!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAlex data to Google Bigquery

Available datasets

Instructions

Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

OpenAlex data to Google Bigquery

Available datasets

Instructions

Notes