Skip to content

Add active learning API for predicate-based blocking and matching #54

@jstammers

Description

@jstammers

The active learning method that dedupe implements to learn a minimal set of predicates to block records can be very useful, particularly when the user has little prior knowledge on an appropriate set of blocking rules. In some of my usage, I've found that it can be tricky to balance the recall and precision when blocking on a single feature, e.g. zip code. More complex predicates would most likely help reduce the number of candidate pairs, so having a semi-supervised method to learn these could reduce the manual work required.

In practice, I have found that dedupe's implementation scales quite poorly - I frequently hit memory bottlenecks when blocking on datasets of more than 10k rows. I'm hoping that by using duckdb via ibis we may be able to do something more performative.

At a high-level, the active learner works as follows:

  • define a schema for the dataset, mapping columns to variable types (String, ShortString, Datetime, Integer, Text, LatLong etc.)
  • define a BlockLearner that learns to classify the predicates that result in known matches
  • define a MatchLearner that learns to classify matches based on features generated from record pairs
  • create a set of candidate predicates for blocking using the schema
  • generate initial candidates using the BlockLearner
  • mark a random pair of records as distinct and the same record twice as a match and fit the learners
  • score the candidate pairs using the two learners
  • provide a new record to the user:
    • if there are records that the MatchLearner predicts are a match, but are not covered by a blocking rule, choose one of these, weighted by the match likelihood
    • otherwise, if any matches are predicted, sample from the predicted matches, weighted by the match likelihood
    • otherwise, if no predicted matches, sample from the pairs, weighted by the disagreement between the classifiers
  • the user labels the pair as match/distinct/not sure and the classifiers are re-fit using the additional labelled example

If we were to implement something like this, I think it's worth ensuring this can be done using the current mismo API.

For example, I can see how the candidate pairs could be blocked using a UnionBlocker and appropriately defined ConditionBlockers. Similarly, the features for the MatchLearner could be generated using a set of LevelComparers - which then learns to predict p(match | comparisons).

It's not immediately clear to me how best to decide the predicates for blocking based on labelled examples, but I expect that could be done by keeping track of the blocking rules. How efficient this is when we have many blockers remains to be seen.

I like the idea of leaving the predicates and features open to the user, as dedupe's approach of statically-defining the predicates based on the data type makes it hard for me to understand where the performance bottle-necks are for a given matching/linking task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions