This project aims to provide a comprehensive framework for semantic entity resolution, enabling the identification and disambiguation of entities across various data sources. It is based on the blog post The Rise of Semantic Entity Resolution.
Phase 1 - Semantic Blocking
- Semantic Clustering - Clusters records using sentence embeddings to group them into efficient blocks for pairwise comparison at quadratic complexity.
Phase 2 - Schema Alignment, Matching and Merging with Large Language Models
- Schema Alignment - Aligns schemas of common entities with different formats
- Entity Matching - Matches entire blocks of records
- Entity Merging - Merges matched entities in entire blocks of records
- Match Evaluation - Evaluates the quality of matches using various metrics
All three operations occur in a single prompt guided by metadata from DSPy signatures, in BAML format with Google Gemini models.
Phase 3 - Edge Resolution
- Edge Resolution - Merging nodes results in duplicate edges.
- Python 3.12
- Poetry for dependency management - see POETRY.md for installation instructions
- Java 11/17 (for Apache Spark)
- Apache Spark 3.5.5+
- 4GB+ RAM recommended (for Spark processing)
- Clone the repository:
git clone https://github.com/Graphlet-AI/serf.git
cd serf
- Create a conda / virtual environment:
In conda
:
conda create -n serf python=3.12
conda activate serf
With venv
:
python -m venv venv
source venv/bin/activate
- Install dependencies:
poetry install
- Install pre-commit checks:
pre-commit install
The SERF CLI provides commands for running the entity resolution pipeline:
$ serf --help
Usage: serf [OPTIONS] COMMAND [ARGS]...
SERF: Semantic Entity Resolution Framework CLI.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
block Perform semantic blocking on input data.
edges Resolve edges after node merging.
match Align schemas, match entities, and merge within blocks.
The easiest way to get started with SERF is using Docker and docker compose
. This ensures a consistent development environment.