Diffbot Neo4j Import

Import Diffbot knowledge graph data into Neo4j and process with OpenAI embeddings.

Overview

This project provides tools to:

Import Diffbot knowledge graph data (Organizations and Persons) into Neo4j
Fetch and import articles from Diffbot API based on entity tags
Generate and store OpenAI embeddings for entities and article chunks
Process embeddings via OpenAI Batch API for cost efficiency

Prerequisites

Python 3.13+
uv package manager
Neo4j database instance
Diffbot API token
OpenAI API token

Installation

Clone the repository:

git clone <repository-url>
cd diffbot-neo4j-import

Install dependencies using uv:

uv sync

Create a .env file from the example:

cp .env.example .env

Edit .env and add your credentials:

DIFFBOT_TOKEN=your_diffbot_token_here
OPENAI_API_TOKEN=your_openai_api_token_here
NEO4J_URI=neo4j+s://your-neo4j-instance.com:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password_here

Project Structure

diffbot_import.py - Import Organizations and Persons from Diffbot JSON exports to Neo4j
diffbot_articles.py - Fetch articles from Diffbot API and import with embeddings
export_openai_json.py - Export entity data to JSONL format for OpenAI Batch API
upload_openai.py - Upload and process embeddings via OpenAI Batch API
main.py - Simple hello world entry point

Obtaining Source Data

Before importing data, you need to download the Diffbot knowledge graph exports. Create a data directory and fetch the data:

# Create the data directory (default: ./data)
mkdir -p ./data

# Download organizations (all orgs with importance > 1)
curl "https://kg.diffbot.com/kg/v3/dql?type=query&token=<DIFFBOT_TOKEN>&query=type%3AOrganization+importance%3E1&size=-1" > ./data/organizations.json

# Download persons (optional, similar query)
curl "https://kg.diffbot.com/kg/v3/dql?type=query&token=<DIFFBOT_TOKEN>&query=type%3APerson+importance%3E1&size=-1" > ./data/persons.json

Replace <DIFFBOT_TOKEN> with your actual Diffbot API token.

Note: These files can be very large (several GB). Ensure you have sufficient disk space. You can configure a different data directory by setting DATA_DIR in your .env file.

Usage

1. Import Diffbot Knowledge Graph Data

Import organizations or persons from Diffbot JSON exports:

uv run python diffbot_import.py

Configuration in file:

Edit line 483 to switch between organizations/persons
SKIP_ENTITIES - Number of entities to skip (line 432)
BATCH_SIZE - Batch size for imports (line 433)

Requirements:

Large JSON files at $DATA_DIR/organizations.json or $DATA_DIR/persons.json (default: ./data/)

2. Fetch and Import Articles

Fetch recent articles for top organizations and generate embeddings:

uv run python diffbot_articles.py

Configuration in file:

Line 195: update_articles(10000, 3) - Top 10K orgs, 3 days lookback
CONCURRENCY_LIMIT - Concurrent API requests (line 14)
MAX_RETRIES - Retry attempts for failed requests (line 15)

3. Export Entities for OpenAI Embeddings

Export entity data to JSONL format for batch processing:

uv run python export_openai_json.py

Configuration in file:

TYPE - Entity type: "Person" or "Organization" (line 6)
SKIP - Number of records to skip (line 7)

4. Process Embeddings via OpenAI Batch API

Upload, monitor, and update embeddings using OpenAI's Batch API:

# Upload a JSONL file
uv run python upload_openai.py upload /path/to/file.jsonl

# List recent batches
uv run python upload_openai.py list 10

# Check batch status
uv run python upload_openai.py status <batch_id>

# Download completed batch
uv run python upload_openai.py download <batch_id> output.jsonl

# Export entities to JSONL
uv run python upload_openai.py export Person 0 50000

# Upload and export all in one command
uv run python upload_openai.py all Person 0 100000

# Update Neo4j with completed embeddings
uv run python upload_openai.py update <batch_id> Person

# Update by skip value
uv run python upload_openai.py update_by_skip 0 Person

# Cancel batch
uv run python upload_openai.py cancel <batch_id>

Environment Variables

Variable	Description	Default
`DIFFBOT_TOKEN`	Diffbot API token	Required
`OPENAI_API_TOKEN`	OpenAI API token	Required
`NEO4J_URI`	Neo4j connection URI	`neo4j+s://diffbot.neo4jlabs.com:7687`
`NEO4J_USERNAME`	Neo4j username	`neo4j`
`NEO4J_PASSWORD`	Neo4j password	Required
`DATA_DIR`	Directory for data files	`./data`

Neo4j Database Configuration

The scripts automatically create constraints and indexes for:

Organizations, Persons, Classifications
Places, Countries, Technographics
Articles, Tags, Categories, Chunks
Investment Series, Revenue Years, SEC Forms

Data Flow

Import base data: diffbot_import.py loads Organizations/Persons
Fetch articles: diffbot_articles.py retrieves recent articles for entities
Export for embeddings: export_openai_json.py prepares data for OpenAI
Process embeddings: upload_openai.py handles batch processing and storage

Notes

Article embeddings are processed using Neo4j's GenAI integration
Large JSON files should be placed in the data directory (configured via DATA_DIR)
Batch processing is more cost-effective for large embedding operations
The project uses async/await for efficient API interactions

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
diffbot_articles.py		diffbot_articles.py
diffbot_import.py		diffbot_import.py
export_openai_json.py		export_openai_json.py
pyproject.toml		pyproject.toml
upload_openai.py		upload_openai.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diffbot Neo4j Import

Overview

Prerequisites

Installation

Project Structure

Obtaining Source Data

Usage

1. Import Diffbot Knowledge Graph Data

2. Fetch and Import Articles

3. Export Entities for OpenAI Embeddings

4. Process Embeddings via OpenAI Batch API

Environment Variables

Neo4j Database Configuration

Data Flow

Notes

License

About

Uh oh!

Releases

Packages

Languages

neo4j-contrib/diffbot-neo4j-import

Folders and files

Latest commit

History

Repository files navigation

Diffbot Neo4j Import

Overview

Prerequisites

Installation

Project Structure

Obtaining Source Data

Usage

1. Import Diffbot Knowledge Graph Data

2. Fetch and Import Articles

3. Export Entities for OpenAI Embeddings

4. Process Embeddings via OpenAI Batch API

Environment Variables

Neo4j Database Configuration

Data Flow

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages