Skip to content

ECMClassifier returns almost all candidate pairs #193

@Evnsn

Description

@Evnsn
import recordlinkage
from recordlinkage.index import Block
from recordlinkage.compare import String
from recordlinkage.datasets import load_febrl3

df, true_links = load_febrl3(True)

# Generate candidate pairs
indexer = recordlinkage.Index([
    Block("date_of_birth")
])

candidate_pairs = indexer.index(df)

print(len(candidate_pairs)) # Returns 5966

# Generate comparison vectors
comparer = recordlinkage.Compare([
    String("given_name", "given_name", method="jarowinkler", label="given_name"),
    String("surname", "surname", method="jarowinkler", label="surname"),
    String("soc_sec_id", "soc_sec_id", method="jarowinkler", label="soc_sec_id"),
    String("address_1", "address_1", method="jarowinkler", label="address_1"),
])

comparison_vector = comparer.compute(candidate_pairs, df)

# Match entities
ecm = recordlinkage.ECMClassifier(binarize=0.1)

pred_links = ecm.fit_predict(comparison_vector)

print(len(pred_links)) # Returns 5836

I attempted to replicate my problem in the code snippet above. There are 5966 candidate pairs and my ECM classifier returns 5836 of them as matches.

Problem: I want to use ECMClassifier for Entity matching. However, when I apply it to my dataset, ALL the candidate pairs are identified as matches, which is unfortunate.

Is there some parameter I can set to tweak the threshold for match vs non-match, or am I missing something else here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions