LinkTransformer is a Python package for semantic record linkage, candidate retrieval, row transformation, clustering, and text classification over tabular data.
- Paper: https://arxiv.org/abs/2309.00789
- Website: https://linktransformer.github.io/
- Demo video: https://www.youtube.com/watch?v=Sn47nmCvV9M
- Link records with LinkTransformer: https://colab.research.google.com/drive/1OqUB8sqpUvrnC8oa_1RoOUzV6DaAKL4N?usp=sharing
- Train your own LinkTransformer model: https://colab.research.google.com/drive/1tHitPGjMMI2Nvh4wwA8rdcbYfbLaJDvg?usp=sharing
- Classify text with LinkTransformer: https://colab.research.google.com/drive/1hSh_p8j7LP2RfdtxrPslOfnogC_CbYw5?usp=sharing
- Demo app (Hugging Face Space): https://huggingface.co/spaces/96abhishekarora/linktransformer_merge
- Feature deck: https://www.dropbox.com/scl/fi/dquxru8bndlyf9na14cw6/A-python-package-to-do-easy-record-linkage-using-Transformer-models.pdf?rlkey=fiv7j6c0vgl901y940054eptk&dl=0
More tutorials are coming soon.
pip install linktransformerimport os
import pandas as pd
import linktransformer as lt
left_df = pd.DataFrame({"CompanyName": ["Tech Corporation"], "Country": ["USA"]})
right_df = pd.DataFrame({"CompanyName": ["Tech Corp"], "Country": ["USA"]})
out = lt.merge(
left_df,
right_df,
on=["CompanyName", "Country"],
model="sentence-transformers/all-MiniLM-L6-v2",
)
print(out[["CompanyName_x", "CompanyName_y", "score"]])merge_k_judge is the recommended end-to-end linkage API when you want both retrieval and LLM adjudication with confidence.
- Retrieve top-
kcandidates with embeddings (merge_knn) - Judge each candidate pair with an LLM
- Return match decisions and confidence scores
judged = lt.merge_k_judge(
df1=left_df,
df2=right_df,
on=["CompanyName", "Country"],
k=5,
knn_sbert_model="sentence-transformers/all-MiniLM-L6-v2",
judge_llm_model="gpt-4o-mini",
llm_provider="openai",
openai_key=os.getenv("OPENAI_API_KEY"),
)
# key output columns:
# - score (retrieval similarity)
# - is_match (bool)
# - confidence (float in [0, 1] when available)You can also combine providers (for example OpenAI embeddings retrieval + Gemini judge) by setting knn_api_model, judge_llm_model, and llm_provider explicitly.
lt.merge(...): semantic 1:1 / 1:m / m:1 linkage.lt.merge_knn(...): top-kcandidate retrieval.lt.merge_blocking(...): run merge within blocks to do fuzzy merge within exact matches.lt.aggregate_rows(...): map fine rows to coarser labels.
matches = lt.merge_knn(
left_df,
right_df,
on=["CompanyName", "Country"],
model="sentence-transformers/all-MiniLM-L6-v2",
k=3,
)Use lt.transform_rows(...) to normalize, rewrite, or standardize values in one or more columns. Eg : Fix OCR errors in the Column, Standardize names.
cleaned = lt.transform_rows(
left_df,
on=["CompanyName", "Country"],
model="gpt-4o-mini",
openai_key=os.getenv("OPENAI_API_KEY"),
openai_prompt=(
"Standardize organization names and country strings for record linkage. "
"Return a JSON list in the same order."
),
)
# adds: transformed_CompanyName-Countrylt.cluster_rows(...): cluster semantically similar rows.lt.dedup_rows(...): cluster + keep representative rows.
deduped = lt.dedup_rows(
left_df,
on="CompanyName",
model="sentence-transformers/all-MiniLM-L6-v2",
cluster_type="agglomerative",
cluster_params={"threshold": 0.7},
)lt.evaluate_pairs(...): similarity over known pairs.lt.all_pair_combos_evaluate(...): dense pairwise scoring.
lt.classify_rows(...): classify rows with HF or OpenAI chat models.lt.train_clf_model(...): train a custom row classifier.
lt.train_model(...): train a linkage model from paired or clustered data.
- OpenAI key: set
OPENAI_API_KEYor passopenai_key. - Gemini key: set
GEMINI_API_KEYor passgemini_key. - API embedding models and local SBERT models are both supported.
- For multi-column API retrieval, LinkTransformer serializes columns safely using
<SEP>.
Tests use test_lt_* naming to mirror the package API surface and make workflows discoverable.
Issues and pull requests are welcome.
This project is licensed under the MIT License. See LICENSE.
- Sam Jones (
samuelcaronnajones) - Abhishek Arora (
econabhishek) - Yiyang Chen (
oooyiyangc)
