Skip to content

PreLinker semantics #77

@NickCrews

Description

@NickCrews

This is my braindump for figuring out how to solve what I am currently calling "PreLinking": Sometimes, your records already have a ground truth label, such as a social security number when working with people. We should try to leverage this info to reduce the amount of blocking/comparing we need to do.

Related to this feature request I made in splink 2 years ago.

This is a complex feature, because PreLinking touches many stages of the overall linking process:

  • affects which records are even included in blocking
  • affects the rules that go into blocking
  • after the uncertain records go through the regular blocking and comparing process, those results need to be merged back into the "shortcircuited" PreLink results to get our final Linkage:
flowchart LR
    all[all records] --> definite
    all[all records] --> undefinite --> blocked --> compared
    definite & compared --> result
Loading

Link

The "dirty-dirty" ER task, where each record in left can link to 0-N records in right, and each record in right can link to 0-M records in left.

  • We know immediately (shown with thick lines) that any records where left.label == right.label is a definite match. We don't need to do any more blocking/comparing/etc on these links.
  • Any record with a label of NULL needs to be compared to every record in the other table, even to those that HAVE labels.
  • As a consequence of the above, we cannot drop any of the records from left or right for the blocking/comparing/etc phases
---
config:
  look: classic
  layout: elk
  elk:
    mergeEdges: true
---
    %% nodePlacementStrategy: SIMPLE
    %% nodePlacementStrategy: LINEAR_SEGMENTS
    %% nodePlacementStrategy: NETWORK_SIMPLEX
    %% nodePlacementStrategy: BRANDES_KOEPF
flowchart LR
 subgraph h["Haystack"]
        nullh0(("∅"))
        nullh1(("∅"))
        Ah0(("A"))
        Bh0(("B"))
  end
 subgraph n["Needle"]
        nulln0(("∅"))
        nulln1(("∅"))
        An0(("A"))
        Bn0(("B"))
        Bn1(("B"))
        Cn0(("C"))
  end
    %% nullh0 ~~~ nullh1 ~~~ Ah0 ~~~ Bh0
    %% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
    Ah0 === An0
    Bh0 === Bn0 & Bn1
    nullh0 -.- nulln1 & nulln0 & An0 & Bn0 & Bn1 & Cn0
    nullh1 -.- nulln1 & nulln0 & An0 & Bn0 & Bn1 & Cn0
    Ah0 -.- nulln1 & nulln0
    Bh0 -.- nulln1 & nulln0
     nullh0:::Ash
     nullh1:::Ash
    %%  nullh2:::Ash
     Ah0:::Peach
     Bh0:::Aqua
     nulln0:::Ash
     nulln1:::Ash
    
     An0:::Peach
     Bn0:::Aqua
     Bn1:::Aqua
     Cn0:::Rose
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
    classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
Loading

Lookup

AK "clean-dirty ER", where each record in haystack can match 0-N records in needle, but each record in needle can match only 0 or 1 records in haystack.

  • We know immediately (shown with thick lines) that any records where haystack.label == needle.label is a definite match. We don't need to do any more blocking or comparing on them. This is similar to the Link task.

In addition to the Link task, we can add another constraint:

  • As a result of the above, any records in needle where needle.label.isin(haystack.label) we don't need to consider further during the blocking/comparing/etc process: we already found the matching record in haystack! An example of this is records A and B in the needle.
---
config:
  look: classic
  layout: elk
  elk:
    mergeEdges: true
---
    %% nodePlacementStrategy: SIMPLE
    %% nodePlacementStrategy: LINEAR_SEGMENTS
    %% nodePlacementStrategy: NETWORK_SIMPLEX
    %% nodePlacementStrategy: BRANDES_KOEPF
flowchart LR
 subgraph h["Haystack"]
        nullh0(("∅"))
        nullh1(("∅"))
        Ah0(("A"))
        Bh0(("B"))
  end
 subgraph n["Needle"]
        nulln0(("∅"))
        nulln1(("∅"))
        An0(("A"))
        Bn0(("B"))
        Bn1(("B"))
        Cn0(("C"))
  end
    %% nullh0 ~~~ nullh1 ~~~ Ah0 ~~~ Bh0
    %% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
    nullh0 -.- nulln1 & nulln0 & Cn0
    nullh1 -.- nulln1 & nulln0 & Cn0
    Ah0 === An0
    Bh0 === Bn0 & Bn1
    Ah0 -.- nulln1 & nulln0
    Bh0 -.- nulln1 & nulln0
     nullh0:::Ash
     nullh1:::Ash
    %%  nullh2:::Ash
     Ah0:::Peach
     Bh0:::Aqua
     nulln0:::Ash
     nulln1:::Ash
    
     An0:::Peach
     Bn0:::Aqua
     Bn1:::Aqua
     Cn0:::Rose
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
    classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
Loading

Dedupe

In a dedupe task, the links you are unsure about are ones where one or both of the labels are NULL. Where the labels are equal, that means its a definite link.

  • We know immediately (shown with thick lines) that any records where left.label == right.label is a definite match. We don't need to do any more blocking/comparing/etc on these links.
  • Any record with a label of NULL needs to be compared to every record in the table, even to those that HAVE labels.
  • As a consequence of the above, we cannot drop any of the records for the blocking/comparing/etc phases
---
config:
  look: classic
---
flowchart TD
    null0(("∅"))
    null1(("∅"))
    A0(("A"))
    B0(("B"))
    B1(("B"))
    B2(("B"))
    %% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
    %% null0 ~~~ null1 ~~~ A0 ~~~ B0 ~~~ B1 ~~~ B2
    null0 -.- null1
    null0 -.- A0 & B0 & B1 & B2
    null1 -.- A0 & B0 & B1 & B2
    B0 === B1
    B0 === B2
    B1 === B2
     A0:::Peach
     B0:::Aqua
     null0:::Ash
     null1:::Ash
     A0:::Peach
     B0:::Aqua
     B1:::Aqua
     B2:::Aqua
    %%  C0:::Rose
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
    classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
Loading

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions