Skip to content

neo4j-contrib/diffbot-neo4j-import

Repository files navigation

Diffbot Neo4j Import

Import Diffbot knowledge graph data into Neo4j and process with OpenAI embeddings.

Overview

This project provides tools to:

  • Import Diffbot knowledge graph data (Organizations and Persons) into Neo4j
  • Fetch and import articles from Diffbot API based on entity tags
  • Generate and store OpenAI embeddings for entities and article chunks
  • Process embeddings via OpenAI Batch API for cost efficiency

Prerequisites

  • Python 3.13+
  • uv package manager
  • Neo4j database instance
  • Diffbot API token
  • OpenAI API token

Installation

  1. Clone the repository:
git clone <repository-url>
cd diffbot-neo4j-import
  1. Install dependencies using uv:
uv sync
  1. Create a .env file from the example:
cp .env.example .env
  1. Edit .env and add your credentials:
DIFFBOT_TOKEN=your_diffbot_token_here
OPENAI_API_TOKEN=your_openai_api_token_here
NEO4J_URI=neo4j+s://your-neo4j-instance.com:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password_here

Project Structure

  • diffbot_import.py - Import Organizations and Persons from Diffbot JSON exports to Neo4j
  • diffbot_articles.py - Fetch articles from Diffbot API and import with embeddings
  • export_openai_json.py - Export entity data to JSONL format for OpenAI Batch API
  • upload_openai.py - Upload and process embeddings via OpenAI Batch API
  • main.py - Simple hello world entry point

Obtaining Source Data

Before importing data, you need to download the Diffbot knowledge graph exports. Create a data directory and fetch the data:

# Create the data directory (default: ./data)
mkdir -p ./data

# Download organizations (all orgs with importance > 1)
curl "https://kg.diffbot.com/kg/v3/dql?type=query&token=<DIFFBOT_TOKEN>&query=type%3AOrganization+importance%3E1&size=-1" > ./data/organizations.json

# Download persons (optional, similar query)
curl "https://kg.diffbot.com/kg/v3/dql?type=query&token=<DIFFBOT_TOKEN>&query=type%3APerson+importance%3E1&size=-1" > ./data/persons.json

Replace <DIFFBOT_TOKEN> with your actual Diffbot API token.

Note: These files can be very large (several GB). Ensure you have sufficient disk space. You can configure a different data directory by setting DATA_DIR in your .env file.

Usage

1. Import Diffbot Knowledge Graph Data

Import organizations or persons from Diffbot JSON exports:

uv run python diffbot_import.py

Configuration in file:

  • Edit line 483 to switch between organizations/persons
  • SKIP_ENTITIES - Number of entities to skip (line 432)
  • BATCH_SIZE - Batch size for imports (line 433)

Requirements:

  • Large JSON files at $DATA_DIR/organizations.json or $DATA_DIR/persons.json (default: ./data/)

2. Fetch and Import Articles

Fetch recent articles for top organizations and generate embeddings:

uv run python diffbot_articles.py

Configuration in file:

  • Line 195: update_articles(10000, 3) - Top 10K orgs, 3 days lookback
  • CONCURRENCY_LIMIT - Concurrent API requests (line 14)
  • MAX_RETRIES - Retry attempts for failed requests (line 15)

3. Export Entities for OpenAI Embeddings

Export entity data to JSONL format for batch processing:

uv run python export_openai_json.py

Configuration in file:

  • TYPE - Entity type: "Person" or "Organization" (line 6)
  • SKIP - Number of records to skip (line 7)

4. Process Embeddings via OpenAI Batch API

Upload, monitor, and update embeddings using OpenAI's Batch API:

# Upload a JSONL file
uv run python upload_openai.py upload /path/to/file.jsonl

# List recent batches
uv run python upload_openai.py list 10

# Check batch status
uv run python upload_openai.py status <batch_id>

# Download completed batch
uv run python upload_openai.py download <batch_id> output.jsonl

# Export entities to JSONL
uv run python upload_openai.py export Person 0 50000

# Upload and export all in one command
uv run python upload_openai.py all Person 0 100000

# Update Neo4j with completed embeddings
uv run python upload_openai.py update <batch_id> Person

# Update by skip value
uv run python upload_openai.py update_by_skip 0 Person

# Cancel batch
uv run python upload_openai.py cancel <batch_id>

Environment Variables

Variable Description Default
DIFFBOT_TOKEN Diffbot API token Required
OPENAI_API_TOKEN OpenAI API token Required
NEO4J_URI Neo4j connection URI neo4j+s://diffbot.neo4jlabs.com:7687
NEO4J_USERNAME Neo4j username neo4j
NEO4J_PASSWORD Neo4j password Required
DATA_DIR Directory for data files ./data

Neo4j Database Configuration

The scripts automatically create constraints and indexes for:

  • Organizations, Persons, Classifications
  • Places, Countries, Technographics
  • Articles, Tags, Categories, Chunks
  • Investment Series, Revenue Years, SEC Forms

Data Flow

  1. Import base data: diffbot_import.py loads Organizations/Persons
  2. Fetch articles: diffbot_articles.py retrieves recent articles for entities
  3. Export for embeddings: export_openai_json.py prepares data for OpenAI
  4. Process embeddings: upload_openai.py handles batch processing and storage

Notes

  • Article embeddings are processed using Neo4j's GenAI integration
  • Large JSON files should be placed in the data directory (configured via DATA_DIR)
  • Batch processing is more cost-effective for large embedding operations
  • The project uses async/await for efficient API interactions

License

See LICENSE file for details.

About

Diffbot import scripts into neo4j with OpenAI embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages