Entity Resolution POC

Research project testing whether fine-tuned Matryoshka embeddings beat BM25 for structured people-record matching at 500M scale.

New here? Read docs/UNDERSTANDING.md — it's the single-source-of-truth covering everything about this project.

Documentation

Doc	Purpose
`docs/UNDERSTANDING.md`	Complete project overview — read this first
`docs/research-design.md`	Hypotheses, architecture, ablation plan, timeline
`docs/dataset-design.md`	Full data spec — schema, corruptions, triplets, eval sets
`docs/evaluation-protocol.md`	Metric definitions, latency methodology, result format
`docs/decisions.md`	5 Architecture Decision Records (ADRs)
`docs/experiment-log.md`	Per-experiment tracking template

The Problem

The production system uses BM25 to match people across 500M contact records. It fails on dirty data. Lexical token overlap breaks on abbreviated names, typos, missing fields, and swapped email domains.

This project tests whether a dense retriever, fine-tuned on our corruption distribution, improves recall without exceeding production memory constraints.

Approach

Pure embedding retrieval vs BM25: No hybrid approach for baseline comparisons.
Models: Five models evaluated: one lexical baseline and four embedding models ranging from 22M to 600M parameters.
Training: Fine-tuning uses Matryoshka Representation Learning (MRL) and Multiple Negatives Ranking Loss (MNRL) on synthetic triplets. Corruptions mirror production errors.
Inference: Two-stage retrieval. Binary 64-dim HNSW for ANN (4GB for 500M records), followed by full FP32 768-dim re-rank on top-100 candidates.

Quickstart

Prerequisites: Python 3.12+, uv, ~50GB disk.

1. Setup

git clone https://github.com/jayshah5696/entity-resolution-poc
cd entity-resolution-poc
uv sync

2. Data Generation

Generate profiles, training triplets, and evaluation queries.

# Generate 1.2M base profiles and split into index / triplets / eval (~20 min on M3)
uv run python src/data/generate.py --config configs/dataset.yaml --output-dir data/

# Build training triplets from the 200K triplet_source split (~10 min)
uv run python src/data/triplets.py \
    --config configs/dataset.yaml \
    --profiles data/processed/triplet_source.parquet \
    --output-dir data/triplets/

# Build eval query set (10K queries across 6 buckets, ~2 min)
uv run python src/data/eval_set.py \
    --config configs/dataset.yaml \
    --eval-profiles data/eval/eval_profiles.parquet \
    --output-dir data/eval/

3. Evaluation

Quantization & Index Derivation When building a dense index, use the --quantization flag to control memory usage. To test multiple dimensions (MRL) and quantizations without re-running the heavy ML encoding, you can "derive" a new index directly from an existing one:

# 1. Build the base index (heavy ML encode - run once)
uv run python src/eval/build_index.py --model gte_modernbert_base --serialization pipe --quantization fp32 --index-profiles data/processed/index.parquet --eval-profiles data/eval/eval_profiles.parquet --output-dir results/indexes/gte_modernbert_base_pipe_fp32 --device mps

# 2. Derive a 64-dim int8 index (instant CPU-bound slice and quantize)
uv run python src/eval/build_index.py --source-index results/indexes/gte_modernbert_base_pipe_fp32 --output-dir results/indexes/gte_64_int8 --truncate-dim 64 --quantization int8

# 3. Evaluate the derived index
uv run python src/eval/run_eval.py --model gte_modernbert_base --index-dir results/indexes/gte_64_int8 --eval-queries data/eval/eval_queries.parquet --output results/gte_64_int8.json --serialization pipe

# Build BM25 index and run evaluation
uv run python src/eval/build_index.py --model bm25_baseline --serialization pipe --index-profiles data/processed/index.parquet --eval-profiles data/eval/eval_profiles.parquet --output-dir results/indexes/bm25_pipe --models-config configs/models.yaml
uv run python src/eval/run_bm25.py --index-dir results/indexes/bm25_pipe --eval-queries data/eval/eval_queries.parquet --output results/001_bm25_pipe.json --serialization pipe --experiment-id 001

# Build Dense embedding index and run evaluation (FP32)
uv run python src/eval/build_index.py --model gte_modernbert_base --serialization pipe --quantization fp32 --index-profiles data/processed/index.parquet --eval-profiles data/eval/eval_profiles.parquet --output-dir results/indexes/gte_modernbert_base_pipe_fp32 --device mps
uv run python src/eval/run_eval.py --model gte_modernbert_base --index-dir results/indexes/gte_modernbert_base_pipe_fp32 --eval-queries data/eval/eval_queries.parquet --output results/004_gte_modernbert_pipe_fp32.json --serialization pipe --experiment-id 004

# Evaluate Fine-Tuned pplx-embed model (background)
nohup bash -c 'uv run python src/eval/build_index.py --model pplx_embed_v1_06b --model-path jayshah5696/er-pplx-embed-v1-06b-pipe-ft --serialization pipe --quantization fp32 --index-profiles data/processed/index.parquet --eval-profiles data/eval/eval_profiles.parquet --output-dir results/indexes/pplx_embed_v1_06b_ft_pipe --device mps && uv run python src/eval/run_eval.py --model pplx_embed_v1_06b --model-path jayshah5696/er-pplx-embed-v1-06b-pipe-ft --index-dir results/indexes/pplx_embed_v1_06b_ft_pipe --eval-queries data/eval/eval_queries.parquet --output results/pplx_embed_v1_06b_ft_pipe.json --serialization pipe --experiment-id pplx_embed_v1_06b_ft' > eval_pplx_embed_ft.log 2>&1 &

# Aggregate results
uv run python src/eval/aggregate.py --results-dir results/ --output-csv results/master_results.csv --output-report results/report.md

4. Fine-Tuning (Modal)

Fine-tune all 5 models in parallel on Modal A10G.

# Push training data to HuggingFace Hub (one-time local setup)
export HF_HUB_DISABLE_XET=1
hf auth login
hf upload jayshah5696/entity-resolution-triplets data/triplets/triplets.parquet triplets.parquet --repo-type dataset

# Run the training
modal run src/models/finetune_modal.py::run_all

5. Tests

uv run pytest tests/ -v

Models

#	Model	Params	Dims	MRL	License	Role
1	BM25 (rank_bm25)	--	--	--	Apache	Lexical baseline
2	all-MiniLM-L6-v2	22M	384	No	Apache	Absolute floor
3	bge-small-en-v1.5	33M	384	Yes	MIT	Efficiency baseline
4	gte-modernbert-base	149M	768	Yes	Apache	Primary candidate
5	nomic-embed-text-v1.5	137M	768	Yes	Apache	MRL reference
6	pplx-embed-v1-0.6b	600M	1536	Yes	Apache	Zero-shot ceiling

Note: nomic requires search_query: / search_document: prefixes. pplx uses separate system prompts. See docs/UNDERSTANDING.md.

Repository

configs/        model registry, dataset config, finetune hyperparams, eval settings
docs/           research design, dataset spec, evaluation protocol, model notes
src/
  data/         profile generation, corruption engine, serialization, triplet building
  models/       embedding wrappers and BM25
  eval/         retrieval evaluation harness
  utils/        nicknames, config loading
experiments/    per-experiment tracking
data/           raw profiles, processed records, triplets, eval sets
models/         fine-tuned checkpoints
results/        JSON per experiment + master CSV
tests/          pytest suite

Key Findings

Our exhaustive 53-experiment ablation study evaluating dimensionality reduction, baseline capacities, and quantization returned three core insights:

GTE-ModernBERT (149M) yields the highest maximum performance on heavily degraded data: pulling an overall 0.966 R@10 compared to the baseline 0.958 BM25 bounds.
Nomic-embed-text-v1.5 suffered catastrophic forgetting when exposed to structured text without instruction prefixes, dropping from 48.8% to 15.2% retrieval efficacy on missing fields.
The Pareto Victor: Combining Matryoshka outputs with integer quantization yielded MiniLM-L6 at 128D (INT8) serving results in under 7ms latency while keeping the index footprint under 700MB.

You can read the entire published writeup in BLOG_POST.md.

Experiment Log

ID	Name	Status	Key Result
001	BM25 baseline (pipe, 1M index)	✅ done	`0.917` overall MRR@10
002	Nomic v1.5 zero-shot (pipe)	✅ done	Performance broke without prefixes
003	GTE-ModernBERT finetuned	✅ done	Highest Recall: R@10 jumps to 0.798 on corruptions
004	BGE-small finetuned	✅ done	Stable performance across all query buckets
005	MiniLM-L6 finetuned	✅ done	The Pareto optimal tradeoff model
006	Quantization ablation via LanceDB	✅ done	Binary limits collapse under 128 dims, INT8 is stable

Detailed breakdowns mapped natively available throughout the results/plots/ outputs generated directly.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
assets		assets
configs		configs
data		data
docs		docs
experiments		experiments
notebooks		notebooks
results		results
scripts		scripts
src		src
tests		tests
writing		writing
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entity Resolution POC

Documentation

The Problem

Approach

Quickstart

1. Setup

2. Data Generation

3. Evaluation

4. Fine-Tuning (Modal)

5. Tests

Models

Repository

Key Findings

Experiment Log

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Entity Resolution POC

Documentation

The Problem

Approach

Quickstart

1. Setup

2. Data Generation

3. Evaluation

4. Fine-Tuning (Modal)

5. Tests

Models

Repository

Key Findings

Experiment Log

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages