-
Notifications
You must be signed in to change notification settings - Fork 4
Description
The Felligi-Sunter model calculates weights by comparing the odds of a variable having a value amongst known pairs compared to randomly sampled pairs.
When using this model to evaluate the likelihood of candidate pairs being a match after blocking this can result in biased estimates, particularly if the variables are more similar between blocked pairs than two chosen at random.
For example, if blocking on a postcode, it is quite likely that two addresses will be fairly similar, even if they are distinct (e.g. same street name, different street number). Without properly considering this, the weights of an FS model could over-estimate the importance of the street name being the same and lead to inaccurate matching odds.
It would be useful to have a mechanism to only sample from blocked pairs when training an FS model, so that the sampled pairs have distributions that are closer to what would be expected of negative matches when using this model to infer matches after blocking. When using labelled known matches, if we sample from blocked pairs and make the assumption that pairs that don't share a record_id correspond to negative matches, it would also be possible to use this dataset to train supervised classification models, such as SVMs, boosted decision trees etc.