Add ability to sample from blocked pairs when training an FS model

The Felligi-Sunter model calculates weights by comparing the odds of a variable having a value amongst known pairs compared to randomly sampled pairs. 

When using this model to evaluate the likelihood of candidate pairs being a match after blocking this can result in biased estimates, particularly if the variables are more similar between blocked pairs than two chosen at random.

For example, if blocking on a postcode, it is quite likely that two addresses will be fairly similar, even if they are distinct (e.g. same street name, different street number). Without properly considering this, the weights of an FS model could over-estimate the importance of the street name being the same and lead to inaccurate matching odds.

It would be useful to have a mechanism to only sample from blocked pairs when training an FS model, so that the sampled pairs have distributions that are closer to what would be expected of negative matches when using this model to infer matches after blocking. When using labelled known matches, if we sample from blocked pairs and make the assumption that pairs that don't share a `record_id` correspond to negative matches, it would also be possible to use this dataset to train supervised classification models, such as SVMs, boosted decision trees etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to sample from blocked pairs when training an FS model #44

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add ability to sample from blocked pairs when training an FS model #44

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions