Add active learning API for predicate-based blocking and matching

The active learning method that `dedupe` implements to learn a minimal set of predicates to block records can be very useful, particularly when the user has little prior knowledge on an appropriate set of blocking rules. In some of my usage, I've found that it can be tricky to balance the recall and precision when blocking on a single feature, e.g. zip code. More complex predicates would most likely help reduce the number of candidate pairs, so having a semi-supervised method to learn these could reduce the manual work required.

In practice, I have found that `dedupe`'s implementation scales quite poorly - I frequently hit memory bottlenecks when blocking on datasets of more than 10k rows. I'm hoping that by using `duckdb` via `ibis` we may be able to do something more performative.

At a high-level, the active learner works as follows:

- define a schema for the dataset, mapping columns to variable types (String, ShortString, Datetime, Integer, Text, LatLong etc.)
- define a BlockLearner that learns to classify the predicates that result in known matches
- define a MatchLearner that learns to classify matches based on features generated from record pairs
- create a set of candidate predicates for blocking using the schema
- generate initial candidates using the BlockLearner
- mark a random pair of records as distinct and the same record twice as a match and fit the learners
- score the candidate pairs using the two learners
- provide a new record to the user:
   - if there are records that the MatchLearner predicts are a match, but are not covered by a blocking rule, choose one of these, weighted by the match likelihood
   - otherwise, if any matches are predicted, sample from the predicted matches, weighted by the match likelihood     
   - otherwise, if no predicted matches, sample from the pairs, weighted by the disagreement between the classifiers
-   the user labels the pair as match/distinct/not sure and the classifiers are re-fit using the additional labelled example

If we were to implement something like this, I think it's worth ensuring this can be done using the current `mismo` API.

For example, I can see how the candidate pairs could be blocked using a `UnionBlocker` and appropriately defined `ConditionBlockers`. Similarly, the features for the MatchLearner could be generated using a set of `LevelComparers` - which then learns to predict `p(match | comparisons)`. 

It's not immediately clear to me how best to decide the predicates for blocking based on labelled examples, but I expect that could be done by keeping track of the blocking rules. How efficient this is when we have many blockers remains to be seen.

I like the idea of leaving the predicates and features open to the user, as dedupe's approach of statically-defining the predicates based on the data type makes it hard for me to understand where the performance bottle-necks are for a given matching/linking task. 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add active learning API for predicate-based blocking and matching #54

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add active learning API for predicate-based blocking and matching #54

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions