High odds in Fellegi-Sunter model after expectation maximization

I am playing with mismo to deduplicate postal addresses in a set of about 10k entries.
After the expectation-maximization step, the odds of half of the record pairs are equal to `10_000_000_000`, hence choosing the threshold to distinguish true matches from false ones is quite hard.

As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.

**Deduplication Steps**

  0) Each record has the following fields:
    - `record_id`
    - `recipients`: a string with the name of the recipient
    - `recipients_metaphone`: the result of `double_metaphone()` on the field `recipients`
    - `recipients_tokens`: a sequence of tokens obtained by splitting the `recipients` field using white spaces.
    - `address_lines`: a string
    - `address_lines_tokens`: a sequence of tokens obtained by splitting the `address_lines` field using white spaces and discarding terms that appears in more than 5% of the dataset (using `mismo.arrays.array_filter_isin_other()` and `mismo.sets.rare_terms()`).
    - `full_address`: the whole address (not including the recipient's name)
    - `libpostal_address`: the address parsed by pypostal using `mismo.lib.geo.postal_parse_address()`
    - `libpostal_fingerprint`: the list fingerprints returned by pypostal using `mismo.lib.geo.postal_fingerprint_address()`
  
  1) Blocking using the following rules:
``` python 
[
    mismo.block.KeyBlocker("recipients", name="Recipients Exact"),
    mismo.block.KeyBlocker("libpostal_fingerprint", name="Fingerprint", ),
    mismo.block.KeyBlocker("recipients_metaphone", name="Recipients phonetic", ),
]
```

The corresponding upset chart is as follows 

![blockers-upset](https://github.com/user-attachments/assets/6996017b-e6a4-402e-aade-d5db770c183a)

  2) Use the following comparators:
    - on `postal_code` field with levels: EXACT, MAJOR (same first two digits), ELSE 
    - on `recipients_tokens` with the `jaccard` function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
    - on `recipients_metaphone` with the `jaccard`function and levels: EXACT (jaccard >= 0.50), MAYBE (jaccard >= 0.2), ELSE
    - on `address_lines_tokens` with the `jaccard` function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
    - on `libpostal_fingerprint` with levels: AT_LEAST_ONE, ELSE

The weights after running `mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000)` are as follows:

![comparers](https://github.com/user-attachments/assets/980bea4f-7f14-4a2e-966e-b6d95e915391)

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

  3) Scoring the pairs leads to

![match-levels](https://github.com/user-attachments/assets/f337266f-9a5a-4195-b781-dda5bf48d5a6)
![match-weights2](https://github.com/user-attachments/assets/7769cfa4-7fc7-488c-9da3-9a80e7da1dc3)

As you can see, the pairs in the left half of the chart all have odds equal to `10_000_000_000`.
Also, the smallest value is `1819` which is way above the "expected" (?) range between `0.01` and `100`.

Am I doing something obviously wrong?

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High odds in Fellegi-Sunter model after expectation maximization #57

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

High odds in Fellegi-Sunter model after expectation maximization #57

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions