Skip to content

A convenient way to link, deduplicate, aggregate and cluster data(frames) in Python using deep learning

License

Notifications You must be signed in to change notification settings

dell-research-harvard/linktransformer

Repository files navigation

LinkTransformer

arXiv LinkTransformer demo

LinkTransformer is a Python package for semantic record linkage, candidate retrieval, row transformation, clustering, and text classification over tabular data.

Tutorials

More tutorials are coming soon.

Installation

pip install linktransformer

Quick Start

import os
import pandas as pd
import linktransformer as lt

left_df = pd.DataFrame({"CompanyName": ["Tech Corporation"], "Country": ["USA"]})
right_df = pd.DataFrame({"CompanyName": ["Tech Corp"], "Country": ["USA"]})

out = lt.merge(
    left_df,
    right_df,
    on=["CompanyName", "Country"],
    model="sentence-transformers/all-MiniLM-L6-v2",
)
print(out[["CompanyName_x", "CompanyName_y", "score"]])

NEW RELEASE: End-to-end linkage Workflow: merge_k_judge (End-to-End Record Linkage)

merge_k_judge is the recommended end-to-end linkage API when you want both retrieval and LLM adjudication with confidence.

  1. Retrieve top-k candidates with embeddings (merge_knn)
  2. Judge each candidate pair with an LLM
  3. Return match decisions and confidence scores
judged = lt.merge_k_judge(
    df1=left_df,
    df2=right_df,
    on=["CompanyName", "Country"],
    k=5,
    knn_sbert_model="sentence-transformers/all-MiniLM-L6-v2",
    judge_llm_model="gpt-4o-mini",
    llm_provider="openai",
    openai_key=os.getenv("OPENAI_API_KEY"),
)

# key output columns:
# - score (retrieval similarity)
# - is_match (bool)
# - confidence (float in [0, 1] when available)

You can also combine providers (for example OpenAI embeddings retrieval + Gemini judge) by setting knn_api_model, judge_llm_model, and llm_provider explicitly.

Core APIs

1) Link two dataframes

  • lt.merge(...): semantic 1:1 / 1:m / m:1 linkage.
  • lt.merge_knn(...): top-k candidate retrieval.
  • lt.merge_blocking(...): run merge within blocks to do fuzzy merge within exact matches.
  • lt.aggregate_rows(...): map fine rows to coarser labels.
matches = lt.merge_knn(
    left_df,
    right_df,
    on=["CompanyName", "Country"],
    model="sentence-transformers/all-MiniLM-L6-v2",
    k=3,
)

2) Transform rows with LLM prompts

Use lt.transform_rows(...) to normalize, rewrite, or standardize values in one or more columns. Eg : Fix OCR errors in the Column, Standardize names.

cleaned = lt.transform_rows(
    left_df,
    on=["CompanyName", "Country"],
    model="gpt-4o-mini",
    openai_key=os.getenv("OPENAI_API_KEY"),
    openai_prompt=(
        "Standardize organization names and country strings for record linkage. "
        "Return a JSON list in the same order."
    ),
)
# adds: transformed_CompanyName-Country

3) Cluster and deduplicate

  • lt.cluster_rows(...): cluster semantically similar rows.
  • lt.dedup_rows(...): cluster + keep representative rows.
deduped = lt.dedup_rows(
    left_df,
    on="CompanyName",
    model="sentence-transformers/all-MiniLM-L6-v2",
    cluster_type="agglomerative",
    cluster_params={"threshold": 0.7},
)

4) Evaluate matched pairs

  • lt.evaluate_pairs(...): similarity over known pairs.
  • lt.all_pair_combos_evaluate(...): dense pairwise scoring.

5) Classification

  • lt.classify_rows(...): classify rows with HF or OpenAI chat models.
  • lt.train_clf_model(...): train a custom row classifier.

6) Train linkage models

  • lt.train_model(...): train a linkage model from paired or clustered data.

Provider Notes

  • OpenAI key: set OPENAI_API_KEY or pass openai_key.
  • Gemini key: set GEMINI_API_KEY or pass gemini_key.
  • API embedding models and local SBERT models are both supported.
  • For multi-column API retrieval, LinkTransformer serializes columns safely using <SEP>.

Test Naming Convention

Tests use test_lt_* naming to mirror the package API surface and make workflows discoverable.

Contributing

Issues and pull requests are welcome.

License

This project is licensed under the MIT License. See LICENSE.

Maintainers

  • Sam Jones (samuelcaronnajones)
  • Abhishek Arora (econabhishek)
  • Yiyang Chen (oooyiyangc)

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages