-
Notifications
You must be signed in to change notification settings - Fork 4
Description
I am playing with mismo to deduplicate postal addresses in a set of about 10k entries.
After the expectation-maximization step, the odds of half of the record pairs are equal to 10_000_000_000, hence choosing the threshold to distinguish true matches from false ones is quite hard.
As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.
Deduplication Steps
-
Each record has the following fields:
-record_id
-recipients: a string with the name of the recipient
-recipients_metaphone: the result ofdouble_metaphone()on the fieldrecipients
-recipients_tokens: a sequence of tokens obtained by splitting therecipientsfield using white spaces.
-address_lines: a string
-address_lines_tokens: a sequence of tokens obtained by splitting theaddress_linesfield using white spaces and discarding terms that appears in more than 5% of the dataset (usingmismo.arrays.array_filter_isin_other()andmismo.sets.rare_terms()).
-full_address: the whole address (not including the recipient's name)
-libpostal_address: the address parsed by pypostal usingmismo.lib.geo.postal_parse_address()
-libpostal_fingerprint: the list fingerprints returned by pypostal usingmismo.lib.geo.postal_fingerprint_address() -
Blocking using the following rules:
[
mismo.block.KeyBlocker("recipients", name="Recipients Exact"),
mismo.block.KeyBlocker("libpostal_fingerprint", name="Fingerprint", ),
mismo.block.KeyBlocker("recipients_metaphone", name="Recipients phonetic", ),
]The corresponding upset chart is as follows
- Use the following comparators:
- onpostal_codefield with levels: EXACT, MAJOR (same first two digits), ELSE
- onrecipients_tokenswith thejaccardfunction and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
- onrecipients_metaphonewith thejaccardfunction and levels: EXACT (jaccard >= 0.50), MAYBE (jaccard >= 0.2), ELSE
- onaddress_lines_tokenswith thejaccardfunction and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
- onlibpostal_fingerprintwith levels: AT_LEAST_ONE, ELSE
The weights after running mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000) are as follows:
I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').
- Scoring the pairs leads to
As you can see, the pairs in the left half of the chart all have odds equal to 10_000_000_000.
Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.
Am I doing something obviously wrong?
P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?



