Import Diffbot knowledge graph data into Neo4j and process with OpenAI embeddings.
This project provides tools to:
- Import Diffbot knowledge graph data (Organizations and Persons) into Neo4j
- Fetch and import articles from Diffbot API based on entity tags
- Generate and store OpenAI embeddings for entities and article chunks
- Process embeddings via OpenAI Batch API for cost efficiency
- Python 3.13+
- uv package manager
- Neo4j database instance
- Diffbot API token
- OpenAI API token
- Clone the repository:
git clone <repository-url>
cd diffbot-neo4j-import- Install dependencies using uv:
uv sync- Create a
.envfile from the example:
cp .env.example .env- Edit
.envand add your credentials:
DIFFBOT_TOKEN=your_diffbot_token_here
OPENAI_API_TOKEN=your_openai_api_token_here
NEO4J_URI=neo4j+s://your-neo4j-instance.com:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password_herediffbot_import.py- Import Organizations and Persons from Diffbot JSON exports to Neo4jdiffbot_articles.py- Fetch articles from Diffbot API and import with embeddingsexport_openai_json.py- Export entity data to JSONL format for OpenAI Batch APIupload_openai.py- Upload and process embeddings via OpenAI Batch APImain.py- Simple hello world entry point
Before importing data, you need to download the Diffbot knowledge graph exports. Create a data directory and fetch the data:
# Create the data directory (default: ./data)
mkdir -p ./data
# Download organizations (all orgs with importance > 1)
curl "https://kg.diffbot.com/kg/v3/dql?type=query&token=<DIFFBOT_TOKEN>&query=type%3AOrganization+importance%3E1&size=-1" > ./data/organizations.json
# Download persons (optional, similar query)
curl "https://kg.diffbot.com/kg/v3/dql?type=query&token=<DIFFBOT_TOKEN>&query=type%3APerson+importance%3E1&size=-1" > ./data/persons.jsonReplace <DIFFBOT_TOKEN> with your actual Diffbot API token.
Note: These files can be very large (several GB). Ensure you have sufficient disk space. You can configure a different data directory by setting DATA_DIR in your .env file.
Import organizations or persons from Diffbot JSON exports:
uv run python diffbot_import.pyConfiguration in file:
- Edit line 483 to switch between organizations/persons
SKIP_ENTITIES- Number of entities to skip (line 432)BATCH_SIZE- Batch size for imports (line 433)
Requirements:
- Large JSON files at
$DATA_DIR/organizations.jsonor$DATA_DIR/persons.json(default:./data/)
Fetch recent articles for top organizations and generate embeddings:
uv run python diffbot_articles.pyConfiguration in file:
- Line 195:
update_articles(10000, 3)- Top 10K orgs, 3 days lookback CONCURRENCY_LIMIT- Concurrent API requests (line 14)MAX_RETRIES- Retry attempts for failed requests (line 15)
Export entity data to JSONL format for batch processing:
uv run python export_openai_json.pyConfiguration in file:
TYPE- Entity type: "Person" or "Organization" (line 6)SKIP- Number of records to skip (line 7)
Upload, monitor, and update embeddings using OpenAI's Batch API:
# Upload a JSONL file
uv run python upload_openai.py upload /path/to/file.jsonl
# List recent batches
uv run python upload_openai.py list 10
# Check batch status
uv run python upload_openai.py status <batch_id>
# Download completed batch
uv run python upload_openai.py download <batch_id> output.jsonl
# Export entities to JSONL
uv run python upload_openai.py export Person 0 50000
# Upload and export all in one command
uv run python upload_openai.py all Person 0 100000
# Update Neo4j with completed embeddings
uv run python upload_openai.py update <batch_id> Person
# Update by skip value
uv run python upload_openai.py update_by_skip 0 Person
# Cancel batch
uv run python upload_openai.py cancel <batch_id>| Variable | Description | Default |
|---|---|---|
DIFFBOT_TOKEN |
Diffbot API token | Required |
OPENAI_API_TOKEN |
OpenAI API token | Required |
NEO4J_URI |
Neo4j connection URI | neo4j+s://diffbot.neo4jlabs.com:7687 |
NEO4J_USERNAME |
Neo4j username | neo4j |
NEO4J_PASSWORD |
Neo4j password | Required |
DATA_DIR |
Directory for data files | ./data |
The scripts automatically create constraints and indexes for:
- Organizations, Persons, Classifications
- Places, Countries, Technographics
- Articles, Tags, Categories, Chunks
- Investment Series, Revenue Years, SEC Forms
- Import base data:
diffbot_import.pyloads Organizations/Persons - Fetch articles:
diffbot_articles.pyretrieves recent articles for entities - Export for embeddings:
export_openai_json.pyprepares data for OpenAI - Process embeddings:
upload_openai.pyhandles batch processing and storage
- Article embeddings are processed using Neo4j's GenAI integration
- Large JSON files should be placed in the data directory (configured via
DATA_DIR) - Batch processing is more cost-effective for large embedding operations
- The project uses async/await for efficient API interactions
See LICENSE file for details.