Skip to content

Incorrect coordinate blocking with missing values #76

@jstammers

Description

@jstammers

I've discovered that mismo.lib.geo.CoordinateBlocker doesn't handle missing values as I'd expect.

If a record has a missing coordinate value, I would not expect it to be blocked as the returned distance would be NaN.

The following example shows that records with a null coordinate value are indeed blocked together

from mismo.lib.geo import CoordinateBlocker
import ibis
ibis.options.interactive = True

con = ibis.get_backend()
data =[{"record_id":1, "lat":1, "lon":1}, {"record_id":2, "lat":2, "lon":None}, {"record_id":3, "lat":3, "lon":None}]
table = con.create_table("test", ibis.memtable(data), overwrite=True)

blocker = CoordinateBlocker(lat="lat", lon="lon", distance_km=1000)

blocker(table, table)
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ record_id_lrecord_id_rlat_llat_rlon_llon_r   ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ int64int64int64int64float64float64 │
├─────────────┼─────────────┼───────┼───────┼─────────┼─────────┤
│           2323NULLNULL │
└─────────────┴─────────────┴───────┴───────┴─────────┴─────────┘

In this case, I can see that mismo.lib.geo.distance_km evaluates to NULL,

I think this can be resolved by modifying the logic here so that it returns null if either lat or lon is null

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions