Skip to content

Address parsing performance #47

@lmores

Description

@lmores

Made a very rough benchmark on a dataset with 2_192_071 Italian street addresses (that unfortunately I am not allowed to share).

I replaced the current implementation of postal_parse_address() with

from postal.parser import parse_address as _parse_address

@ibis.udf.scalar.python
def postal_parse_address(address_string: str) -> ADDRESS_SCHEMA:
    # Initially, the keys match the names of pypostal fields we need.
    # Later, this dict is modified to match the shape of an `ADDRESS_SCHEMA`.
    result: dict[str, str | None] = {
        "house_number": None, "road": None, "unit": None,
        "city": None, "state": None, "postcode": None, "country": None
    }

    parsed_fields = _parse_address(address_string)
    for value, label in parsed_fields:
        # Pypostal returns more fields than the ones we actually need.
        # Here `False` is used as a placeholder under the assumption that
        # such value is never returned by pypostal a field value.
        current = result.get(label, False)

        # Keep only the fields declared when `result` is initialized.
        # Pypostal fields can be repeated, in such case we concat their values.
        if current is not False:
            result[label] = value if current is None else f"{current} {value}"

    # Hack to prepend "house_number" to "road"
    house_number = result.pop("house_number")
    if house_number is not None:
        road = result["road"]
        if road is None:
            result["road"] = house_number
        else:
            result["road"] = f"{house_number} {road}"

    # Modify `result` in-place to match the shape of an `ADDRESS_SCHEMA`.
    result["street1"] = result.pop("road")
    result["street2"] = result.pop("unit")
    result["postal_code"] = result.pop("postcode")

    return result

I am starting from an ibis table t backed by duckdb with the following columns ['record_id', 'administrative_area', 'locality', 'postal_code', 'address_lines', 'recipients', 'region_code', 'language_code', 'sorting_code', 'sublocality', 'organization']

I add a new column containing the whole address:

t = t.mutate(
    full_address = _.address_lines + ", " + _.postal_code + ", " +  _.locality + ", " +  _.administrative_area + ", " +  _.region_code
)

and then run the following snippet for both versions of postal_parse_address():

parse_start = time()
t = t.mutate(
  libpostal_address=postal_parse_address(_.full_address)
)
print(t)
print(time() - parse_start)

Current implementation takes 4.291484832763672 seconds, the above one just 0.206512451171875 seconds.

My laptop specs:

OS: Ubuntu 21.04 x86_64 
Host: Latitude 5500 
Kernel: 5.11.0-49-generic 
CPU: Intel i7-8665U (8) @ 4.800GHz 
GPU: Intel WhiskeyLake-U GT2 [UHD Graphics 620] 
Memory: 8067MiB / 15813MiB 

@NickCrews @jstammers:

  • can you replicate this behaviour on your data sets?
  • do you have a public data set for official benchmarks?
  • how would you like to implement the official benchmarks?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions