PreLinker semantics

This is my braindump for figuring out how to solve what I am currently calling "PreLinking": Sometimes, your records already have a ground truth label, such as a social security number when working with people. We should try to leverage this info to reduce the amount of blocking/comparing we need to do. 

Related to [this feature request I made in splink 2 years ago](https://github.com/moj-analytical-services/splink/issues/1231).

This is a complex feature, because PreLinking touches many stages of the overall linking process:
- affects which records are even included in blocking
- affects the rules that go into blocking
- after the uncertain records go through the regular blocking and comparing process, those results need to be merged back into the "shortcircuited" PreLink results to get our final Linkage:

```mermaid
flowchart LR
    all[all records] --> definite
    all[all records] --> undefinite --> blocked --> compared
    definite & compared --> result
```

## Link

The "dirty-dirty" ER task, where each record in left can link to 0-N records in right, and each record in right can link to 0-M records in left.

- We know immediately (shown with thick lines) that any records where `left.label == right.label` is a definite match. We don't need to do any more blocking/comparing/etc on these links.
- Any record with a label of NULL needs to be compared to every record in the other table, even to those that HAVE labels. 
- As a consequence of the above, we cannot drop any of the records from left or right for the blocking/comparing/etc phases

```mermaid
---
config:
  look: classic
  layout: elk
  elk:
    mergeEdges: true
---
    %% nodePlacementStrategy: SIMPLE
    %% nodePlacementStrategy: LINEAR_SEGMENTS
    %% nodePlacementStrategy: NETWORK_SIMPLEX
    %% nodePlacementStrategy: BRANDES_KOEPF
flowchart LR
 subgraph h["Haystack"]
        nullh0(("∅"))
        nullh1(("∅"))
        Ah0(("A"))
        Bh0(("B"))
  end
 subgraph n["Needle"]
        nulln0(("∅"))
        nulln1(("∅"))
        An0(("A"))
        Bn0(("B"))
        Bn1(("B"))
        Cn0(("C"))
  end
    %% nullh0 ~~~ nullh1 ~~~ Ah0 ~~~ Bh0
    %% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
    Ah0 === An0
    Bh0 === Bn0 & Bn1
    nullh0 -.- nulln1 & nulln0 & An0 & Bn0 & Bn1 & Cn0
    nullh1 -.- nulln1 & nulln0 & An0 & Bn0 & Bn1 & Cn0
    Ah0 -.- nulln1 & nulln0
    Bh0 -.- nulln1 & nulln0
     nullh0:::Ash
     nullh1:::Ash
    %%  nullh2:::Ash
     Ah0:::Peach
     Bh0:::Aqua
     nulln0:::Ash
     nulln1:::Ash
    
     An0:::Peach
     Bn0:::Aqua
     Bn1:::Aqua
     Cn0:::Rose
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
    classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
```

## Lookup

AK "clean-dirty ER", where each record in haystack can match 0-N records in needle, but each record in needle can match only 0 or 1 records in haystack.

- We know immediately (shown with thick lines) that any records where `haystack.label == needle.label` is a definite match. We don't need to do any more blocking or comparing on them. This is similar to the Link task.

In addition to the Link task, we can add another constraint:
 
- As a result of the above, any records in needle where `needle.label.isin(haystack.label)` we don't need to consider further during the blocking/comparing/etc process: we already found the matching record in haystack! An example of this is records A and B in the needle.

```mermaid
---
config:
  look: classic
  layout: elk
  elk:
    mergeEdges: true
---
    %% nodePlacementStrategy: SIMPLE
    %% nodePlacementStrategy: LINEAR_SEGMENTS
    %% nodePlacementStrategy: NETWORK_SIMPLEX
    %% nodePlacementStrategy: BRANDES_KOEPF
flowchart LR
 subgraph h["Haystack"]
        nullh0(("∅"))
        nullh1(("∅"))
        Ah0(("A"))
        Bh0(("B"))
  end
 subgraph n["Needle"]
        nulln0(("∅"))
        nulln1(("∅"))
        An0(("A"))
        Bn0(("B"))
        Bn1(("B"))
        Cn0(("C"))
  end
    %% nullh0 ~~~ nullh1 ~~~ Ah0 ~~~ Bh0
    %% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
    nullh0 -.- nulln1 & nulln0 & Cn0
    nullh1 -.- nulln1 & nulln0 & Cn0
    Ah0 === An0
    Bh0 === Bn0 & Bn1
    Ah0 -.- nulln1 & nulln0
    Bh0 -.- nulln1 & nulln0
     nullh0:::Ash
     nullh1:::Ash
    %%  nullh2:::Ash
     Ah0:::Peach
     Bh0:::Aqua
     nulln0:::Ash
     nulln1:::Ash
    
     An0:::Peach
     Bn0:::Aqua
     Bn1:::Aqua
     Cn0:::Rose
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
    classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
```

## Dedupe

In a dedupe task, the links you are unsure about are ones where one or both of the labels are NULL. Where the labels are equal, that means its a definite link.

- We know immediately (shown with thick lines) that any records where `left.label == right.label` is a definite match. We don't need to do any more blocking/comparing/etc on these links.
- Any record with a label of NULL needs to be compared to every record in the table, even to those that HAVE labels. 
- As a consequence of the above, we cannot drop any of the records for the blocking/comparing/etc phases

```mermaid
---
config:
  look: classic
---
flowchart TD
    null0(("∅"))
    null1(("∅"))
    A0(("A"))
    B0(("B"))
    B1(("B"))
    B2(("B"))
    %% nulln0 ~~~ nulln1 ~~~ An0 ~~~ Bn0 ~~~ Bn1 ~~~ Cn0
    %% null0 ~~~ null1 ~~~ A0 ~~~ B0 ~~~ B1 ~~~ B2
    null0 -.- null1
    null0 -.- A0 & B0 & B1 & B2
    null1 -.- A0 & B0 & B1 & B2
    B0 === B1
    B0 === B2
    B1 === B2
     A0:::Peach
     B0:::Aqua
     null0:::Ash
     null1:::Ash
     A0:::Peach
     B0:::Aqua
     B1:::Aqua
     B2:::Aqua
    %%  C0:::Rose
    classDef Aqua stroke-width:1px, stroke-dasharray:none, stroke:#46EDC8, fill:#DEFFF8, color:#378E7A
    classDef Rose stroke-width:1px, stroke-dasharray:none, stroke:#FF5978, fill:#FFDFE5, color:#8E2236
    classDef Peach stroke-width:1px, stroke-dasharray:none, stroke:#FBB35A, fill:#FFEFDB, color:#8F632D
    classDef Ash stroke-width:1px, stroke-dasharray:none, stroke:#999999, fill:#EEEEEE, color:#000000
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PreLinker semantics #77

Link

Lookup

Dedupe

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PreLinker semantics #77

Description

Link

Lookup

Dedupe

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions