This tool transforms the 2025 Crossref Public Data File into a BigQuery-compatible format. The processor handles several critical transformations:
- Converts date structures into ISO-standard date strings
- Resolves NULL value incompatibilities
- Standardizes field names
- Flattens nested arrays to comply with BigQuery's schema requirements
- Provides special handling for problematic year fields and identifiers
The resulting dataset is publicly available on the I3 Bigquery Data repository here
Join our mailing group: https://groups.google.com/g/i3-bigquery
Visit our website: https://iii.pubpub.org/
.
├── jsonl-processor/ # Main processing module
│ ├── src/ # Source code
│ │ ├── process-all.js # Main processing script
│ │ ├── processor.js # Core processing logic
│ │ └── generate-schema.js # Schema generation for BigQuery
│ ├── data/ # Data directories
│ │ ├── raw/ # Raw input files (.jsonl.gz)
│ │ └── processed/ # Processed output files
│ └── logs/ # Processing logs
- Node.js
- Sufficient disk space (around 400GB)
- Linux environment
- Clone the repository:
git clone https://github.com/Innovation-Information-Initiative/bigquery_crossref.git- Install dependencies:
cd jsonl-processor
npm installThe main processing script can be run with various options:
# Process all files
node src/process-all.js
# Process with debug mode
DEBUG=true node src/process-all.js
# Process a specific file
FILE=example.jsonl.gz node src/process-all.js
# Process without resuming previous work
RESUME=false node src/process-all.js
# Run in quiet mode (no progress display)
QUIET=true node src/process-all.jsProcessed files are saved in the data/processed directory with the naming format:
[file_number]_processed.jsonl.gz
Logs are stored in the logs directory with timestamps.
Before loading data into BigQuery, generate the schema from your processed files:
# First, install one of these schema generators:
npm install -g generate-schema
# OR
pip3 install bigquery-schema-generator
# Then run the schema generator
node jsonl-processor/src/generate-schema.jsFor the complete 2025 Crossref public data file, the automatically generated schema will require additional manual edits. For convenience, a pre-configured schema file is provided in the repository.
Upload all processed files to Google Cloud Storage:
gsutil -m cp jsonl-processor/data/processed/* gs://[bucket]Create a BigQuery table from the uploaded files:
bq load --source_format=NEWLINE_DELIMITED_JSON \
--replace=true \
--project_id=[project_name] \
--max_bad_records=10 \
[dataset_name.table_name] \
gs://[bucket]/* \
schema.jsonThis project is licensed under the MIT License - see the LICENSE file for details.