GitHub - Innovation-Information-Initiative/bigquery_crossref: Tool to process and upload bulk Crossref data to BigQuery

This tool transforms the 2025 Crossref Public Data File into a BigQuery-compatible format. The processor handles several critical transformations:

Converts date structures into ISO-standard date strings
Resolves NULL value incompatibilities
Standardizes field names
Flattens nested arrays to comply with BigQuery's schema requirements
Provides special handling for problematic year fields and identifiers

The resulting dataset is publicly available on the I3 Bigquery Data repository here

Join our mailing group: https://groups.google.com/g/i3-bigquery

Visit our website: https://iii.pubpub.org/

Project Structure

.
├── jsonl-processor/     # Main processing module
│   ├── src/            # Source code
│   │   ├── process-all.js    # Main processing script
│   │   ├── processor.js      # Core processing logic
│   │   └── generate-schema.js # Schema generation for BigQuery
│   ├── data/          # Data directories
│   │   ├── raw/       # Raw input files (.jsonl.gz)
│   │   └── processed/ # Processed output files
│   └── logs/          # Processing logs

Requirements

Node.js
Sufficient disk space (around 400GB)
Linux environment

Installation

Clone the repository:

git clone https://github.com/Innovation-Information-Initiative/bigquery_crossref.git

Install dependencies:

cd jsonl-processor
npm install

Usage

Processing Files

The main processing script can be run with various options:

# Process all files
node src/process-all.js

# Process with debug mode
DEBUG=true node src/process-all.js

# Process a specific file
FILE=example.jsonl.gz node src/process-all.js

# Process without resuming previous work
RESUME=false node src/process-all.js

# Run in quiet mode (no progress display)
QUIET=true node src/process-all.js

Processing Output

Processed files are saved in the data/processed directory with the naming format: [file_number]_processed.jsonl.gz

Logs are stored in the logs directory with timestamps.

Generating BigQuery Schema

Before loading data into BigQuery, generate the schema from your processed files:

# First, install one of these schema generators:
npm install -g generate-schema
# OR
pip3 install bigquery-schema-generator

# Then run the schema generator
node jsonl-processor/src/generate-schema.js

For the complete 2025 Crossref public data file, the automatically generated schema will require additional manual edits. For convenience, a pre-configured schema file is provided in the repository.

Uploading Data to Bigquery

Upload all processed files to Google Cloud Storage:

gsutil -m cp jsonl-processor/data/processed/* gs://[bucket]

Loading Data into BigQuery

Create a BigQuery table from the uploaded files:

bq load --source_format=NEWLINE_DELIMITED_JSON \
        --replace=true \
        --project_id=[project_name] \
        --max_bad_records=10 \ 
        [dataset_name.table_name] \
        gs://[bucket]/* \
        schema.json

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
jsonl-processor		jsonl-processor
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Structure

Requirements

Installation

Usage

Processing Files

Processing Output

Generating BigQuery Schema

Uploading Data to Bigquery

Loading Data into BigQuery

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Structure

Requirements

Installation

Usage

Processing Files

Processing Output

Generating BigQuery Schema

Uploading Data to Bigquery

Loading Data into BigQuery

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages