Skip to content

High odds in Fellegi-Sunter model after expectation maximization #57

@lmores

Description

@lmores

I am playing with mismo to deduplicate postal addresses in a set of about 10k entries.
After the expectation-maximization step, the odds of half of the record pairs are equal to 10_000_000_000, hence choosing the threshold to distinguish true matches from false ones is quite hard.

As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.

Deduplication Steps

  1. Each record has the following fields:
    - record_id
    - recipients: a string with the name of the recipient
    - recipients_metaphone: the result of double_metaphone() on the field recipients
    - recipients_tokens: a sequence of tokens obtained by splitting the recipients field using white spaces.
    - address_lines: a string
    - address_lines_tokens: a sequence of tokens obtained by splitting the address_lines field using white spaces and discarding terms that appears in more than 5% of the dataset (using mismo.arrays.array_filter_isin_other() and mismo.sets.rare_terms()).
    - full_address: the whole address (not including the recipient's name)
    - libpostal_address: the address parsed by pypostal using mismo.lib.geo.postal_parse_address()
    - libpostal_fingerprint: the list fingerprints returned by pypostal using mismo.lib.geo.postal_fingerprint_address()

  2. Blocking using the following rules:

[
    mismo.block.KeyBlocker("recipients", name="Recipients Exact"),
    mismo.block.KeyBlocker("libpostal_fingerprint", name="Fingerprint", ),
    mismo.block.KeyBlocker("recipients_metaphone", name="Recipients phonetic", ),
]

The corresponding upset chart is as follows

blockers-upset

  1. Use the following comparators:
    - on postal_code field with levels: EXACT, MAJOR (same first two digits), ELSE
    - on recipients_tokens with the jaccard function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
    - on recipients_metaphone with the jaccardfunction and levels: EXACT (jaccard >= 0.50), MAYBE (jaccard >= 0.2), ELSE
    - on address_lines_tokens with the jaccard function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
    - on libpostal_fingerprint with levels: AT_LEAST_ONE, ELSE

The weights after running mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000) are as follows:

comparers

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

  1. Scoring the pairs leads to

match-levels
match-weights2

As you can see, the pairs in the left half of the chart all have odds equal to 10_000_000_000.
Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.

Am I doing something obviously wrong?

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions