-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Made a very rough benchmark on a dataset with 2_192_071 Italian street addresses (that unfortunately I am not allowed to share).
I replaced the current implementation of postal_parse_address() with
from postal.parser import parse_address as _parse_address
@ibis.udf.scalar.python
def postal_parse_address(address_string: str) -> ADDRESS_SCHEMA:
# Initially, the keys match the names of pypostal fields we need.
# Later, this dict is modified to match the shape of an `ADDRESS_SCHEMA`.
result: dict[str, str | None] = {
"house_number": None, "road": None, "unit": None,
"city": None, "state": None, "postcode": None, "country": None
}
parsed_fields = _parse_address(address_string)
for value, label in parsed_fields:
# Pypostal returns more fields than the ones we actually need.
# Here `False` is used as a placeholder under the assumption that
# such value is never returned by pypostal a field value.
current = result.get(label, False)
# Keep only the fields declared when `result` is initialized.
# Pypostal fields can be repeated, in such case we concat their values.
if current is not False:
result[label] = value if current is None else f"{current} {value}"
# Hack to prepend "house_number" to "road"
house_number = result.pop("house_number")
if house_number is not None:
road = result["road"]
if road is None:
result["road"] = house_number
else:
result["road"] = f"{house_number} {road}"
# Modify `result` in-place to match the shape of an `ADDRESS_SCHEMA`.
result["street1"] = result.pop("road")
result["street2"] = result.pop("unit")
result["postal_code"] = result.pop("postcode")
return resultI am starting from an ibis table t backed by duckdb with the following columns ['record_id', 'administrative_area', 'locality', 'postal_code', 'address_lines', 'recipients', 'region_code', 'language_code', 'sorting_code', 'sublocality', 'organization']
I add a new column containing the whole address:
t = t.mutate(
full_address = _.address_lines + ", " + _.postal_code + ", " + _.locality + ", " + _.administrative_area + ", " + _.region_code
)and then run the following snippet for both versions of postal_parse_address():
parse_start = time()
t = t.mutate(
libpostal_address=postal_parse_address(_.full_address)
)
print(t)
print(time() - parse_start)Current implementation takes 4.291484832763672 seconds, the above one just 0.206512451171875 seconds.
My laptop specs:
OS: Ubuntu 21.04 x86_64
Host: Latitude 5500
Kernel: 5.11.0-49-generic
CPU: Intel i7-8665U (8) @ 4.800GHz
GPU: Intel WhiskeyLake-U GT2 [UHD Graphics 620]
Memory: 8067MiB / 15813MiB
- can you replicate this behaviour on your data sets?
- do you have a public data set for official benchmarks?
- how would you like to implement the official benchmarks?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels