-
Notifications
You must be signed in to change notification settings - Fork 4
Description
This is my braindump for figuring out how to solve what I am currently calling "PreLinking": Sometimes, your records already have a ground truth label, such as a social security number when working with people. We should try to leverage this info to reduce the amount of blocking/comparing we need to do.
Related to this feature request I made in splink 2 years ago.
This is a complex feature, because PreLinking touches many stages of the overall linking process:
- affects which records are even included in blocking
- affects the rules that go into blocking
- after the uncertain records go through the regular blocking and comparing process, those results need to be merged back into the "shortcircuited" PreLink results to get our final Linkage:
flowchart LR
all[all records] --> definite
all[all records] --> undefinite --> blocked --> compared
definite & compared --> result
Link
The "dirty-dirty" ER task, where each record in left can link to 0-N records in right, and each record in right can link to 0-M records in left.
- We know immediately (shown with thick lines) that any records where
left.label == right.labelis a definite match. We don't need to do any more blocking/comparing/etc on these links. - Any record with a label of NULL needs to be compared to every record in the other table, even to those that HAVE labels.
- As a consequence of the above, we cannot drop any of the records from left or right for the blocking/comparing/etc phases
---
config:
look: classic
layout: elk
elk:
mergeEdges: true
---
%% nodePlacementStrategy: SIMPLE
%% nodePlacementStrategy: LINEAR_SEGMENTS
%% nodePlacementStrategy: NETWORK_SIMPLEX
%% nodePlacementStrategy: BRANDES_KOEPF
flowchart LR
subgraph h["Haystack"]
nullh0(("∅"))
nullh1(("∅"))
Ah0(("A"))
Bh0(("B"))
end
subgraph n["Needle"]
nulln0(("∅"))
nulln1(("∅"))
An0(("A"))
Bn0(("B"))
Bn1(("B"))
Cn0(("C"))
end
%% nullh0 ~~~ nullh1 ~~~ Ah0 ~~~ Bh0
%% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
Ah0 === An0
Bh0 === Bn0 & Bn1
nullh0 -.- nulln1 & nulln0 & An0 & Bn0 & Bn1 & Cn0
nullh1 -.- nulln1 & nulln0 & An0 & Bn0 & Bn1 & Cn0
Ah0 -.- nulln1 & nulln0
Bh0 -.- nulln1 & nulln0
nullh0:::Ash
nullh1:::Ash
%% nullh2:::Ash
Ah0:::Peach
Bh0:::Aqua
nulln0:::Ash
nulln1:::Ash
An0:::Peach
Bn0:::Aqua
Bn1:::Aqua
Cn0:::Rose
classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
Lookup
AK "clean-dirty ER", where each record in haystack can match 0-N records in needle, but each record in needle can match only 0 or 1 records in haystack.
- We know immediately (shown with thick lines) that any records where
haystack.label == needle.labelis a definite match. We don't need to do any more blocking or comparing on them. This is similar to the Link task.
In addition to the Link task, we can add another constraint:
- As a result of the above, any records in needle where
needle.label.isin(haystack.label)we don't need to consider further during the blocking/comparing/etc process: we already found the matching record in haystack! An example of this is records A and B in the needle.
---
config:
look: classic
layout: elk
elk:
mergeEdges: true
---
%% nodePlacementStrategy: SIMPLE
%% nodePlacementStrategy: LINEAR_SEGMENTS
%% nodePlacementStrategy: NETWORK_SIMPLEX
%% nodePlacementStrategy: BRANDES_KOEPF
flowchart LR
subgraph h["Haystack"]
nullh0(("∅"))
nullh1(("∅"))
Ah0(("A"))
Bh0(("B"))
end
subgraph n["Needle"]
nulln0(("∅"))
nulln1(("∅"))
An0(("A"))
Bn0(("B"))
Bn1(("B"))
Cn0(("C"))
end
%% nullh0 ~~~ nullh1 ~~~ Ah0 ~~~ Bh0
%% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
nullh0 -.- nulln1 & nulln0 & Cn0
nullh1 -.- nulln1 & nulln0 & Cn0
Ah0 === An0
Bh0 === Bn0 & Bn1
Ah0 -.- nulln1 & nulln0
Bh0 -.- nulln1 & nulln0
nullh0:::Ash
nullh1:::Ash
%% nullh2:::Ash
Ah0:::Peach
Bh0:::Aqua
nulln0:::Ash
nulln1:::Ash
An0:::Peach
Bn0:::Aqua
Bn1:::Aqua
Cn0:::Rose
classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
Dedupe
In a dedupe task, the links you are unsure about are ones where one or both of the labels are NULL. Where the labels are equal, that means its a definite link.
- We know immediately (shown with thick lines) that any records where
left.label == right.labelis a definite match. We don't need to do any more blocking/comparing/etc on these links. - Any record with a label of NULL needs to be compared to every record in the table, even to those that HAVE labels.
- As a consequence of the above, we cannot drop any of the records for the blocking/comparing/etc phases
---
config:
look: classic
---
flowchart TD
null0(("∅"))
null1(("∅"))
A0(("A"))
B0(("B"))
B1(("B"))
B2(("B"))
%% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
%% null0 ~~~ null1 ~~~ A0 ~~~ B0 ~~~ B1 ~~~ B2
null0 -.- null1
null0 -.- A0 & B0 & B1 & B2
null1 -.- A0 & B0 & B1 & B2
B0 === B1
B0 === B2
B1 === B2
A0:::Peach
B0:::Aqua
null0:::Ash
null1:::Ash
A0:::Peach
B0:::Aqua
B1:::Aqua
B2:::Aqua
%% C0:::Rose
classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000