bigquery_crossref/README.md at main · Innovation-Information-Initiative/bigquery_crossref

This tool transforms the 2025 Crossref Public Data File into a BigQuery-compatible format. The processor handles several critical transformations:

Converts date structures into ISO-standard date strings
Resolves NULL value incompatibilities
Standardizes field names
Flattens nested arrays to comply with BigQuery's schema requirements
Provides special handling for problematic year fields and identifiers

The resulting dataset is publicly available on the I3 Bigquery Data repository here

Join our mailing group: https://groups.google.com/g/i3-bigquery

Visit our website: https://iii.pubpub.org/

Project Structure

.
├── jsonl-processor/     # Main processing module
│   ├── src/            # Source code
│   │   ├── process-all.js    # Main processing script
│   │   ├── processor.js      # Core processing logic
│   │   └── generate-schema.js # Schema generation for BigQuery
│   ├── data/          # Data directories
│   │   ├── raw/       # Raw input files (.jsonl.gz)
│   │   └── processed/ # Processed output files
│   └── logs/          # Processing logs

Requirements

Node.js
Sufficient disk space (around 400GB)
Linux environment

Installation

Clone the repository:

git clone https://github.com/Innovation-Information-Initiative/bigquery_crossref.git

Install dependencies:

cd jsonl-processor
npm install

Usage

Processing Files

The main processing script can be run with various options:

# Process all files
node src/process-all.js

# Process with debug mode
DEBUG=true node src/process-all.js

# Process a specific file
FILE=example.jsonl.gz node src/process-all.js

# Process without resuming previous work
RESUME=false node src/process-all.js

# Run in quiet mode (no progress display)
QUIET=true node src/process-all.js

Processing Output

Processed files are saved in the data/processed directory with the naming format: [file_number]_processed.jsonl.gz

Logs are stored in the logs directory with timestamps.

Generating BigQuery Schema

Before loading data into BigQuery, generate the schema from your processed files:

# First, install one of these schema generators:
npm install -g generate-schema
# OR
pip3 install bigquery-schema-generator

# Then run the schema generator
node jsonl-processor/src/generate-schema.js

For the complete 2025 Crossref public data file, the automatically generated schema will require additional manual edits. For convenience, a pre-configured schema file is provided in the repository.

Uploading Data to Bigquery

Upload all processed files to Google Cloud Storage:

gsutil -m cp jsonl-processor/data/processed/* gs://[bucket]

Loading Data into BigQuery

Create a BigQuery table from the uploaded files:

bq load --source_format=NEWLINE_DELIMITED_JSON \
        --replace=true \
        --project_id=[project_name] \
        --max_bad_records=10 \ 
        [dataset_name.table_name] \
        gs://[bucket]/* \
        schema.json

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Structure

Requirements

Installation

Usage

Processing Files

Processing Output

Generating BigQuery Schema

Uploading Data to Bigquery

Loading Data into BigQuery

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Project Structure

Requirements

Installation

Usage

Processing Files

Processing Output

Generating BigQuery Schema

Uploading Data to Bigquery

Loading Data into BigQuery

License