Address parsing performance

Made a very rough benchmark on a dataset with `2_192_071` Italian street addresses (that unfortunately I am not allowed to share).

I replaced the [current implementation](https://github.com/NickCrews/mismo/blob/7eb6bc1195cb8701dd08015daba2ddf3a0fb4f5c/mismo/lib/geo/_address.py#L281-L326) of `postal_parse_address()` with

```python
from postal.parser import parse_address as _parse_address

@ibis.udf.scalar.python
def postal_parse_address(address_string: str) -> ADDRESS_SCHEMA:
    # Initially, the keys match the names of pypostal fields we need.
    # Later, this dict is modified to match the shape of an `ADDRESS_SCHEMA`.
    result: dict[str, str | None] = {
        "house_number": None, "road": None, "unit": None,
        "city": None, "state": None, "postcode": None, "country": None
    }

    parsed_fields = _parse_address(address_string)
    for value, label in parsed_fields:
        # Pypostal returns more fields than the ones we actually need.
        # Here `False` is used as a placeholder under the assumption that
        # such value is never returned by pypostal a field value.
        current = result.get(label, False)

        # Keep only the fields declared when `result` is initialized.
        # Pypostal fields can be repeated, in such case we concat their values.
        if current is not False:
            result[label] = value if current is None else f"{current} {value}"

    # Hack to prepend "house_number" to "road"
    house_number = result.pop("house_number")
    if house_number is not None:
        road = result["road"]
        if road is None:
            result["road"] = house_number
        else:
            result["road"] = f"{house_number} {road}"

    # Modify `result` in-place to match the shape of an `ADDRESS_SCHEMA`.
    result["street1"] = result.pop("road")
    result["street2"] = result.pop("unit")
    result["postal_code"] = result.pop("postcode")

    return result
```
I am starting from an ibis table `t` backed by duckdb with the following columns `['record_id', 'administrative_area', 'locality', 'postal_code', 'address_lines', 'recipients', 'region_code', 'language_code', 'sorting_code', 'sublocality', 'organization']`

I add a new column containing the whole address:

```python
t = t.mutate(
    full_address = _.address_lines + ", " + _.postal_code + ", " +  _.locality + ", " +  _.administrative_area + ", " +  _.region_code
)
``` 

and then run the following snippet for both versions of `postal_parse_address()`:

```python
parse_start = time()
t = t.mutate(
  libpostal_address=postal_parse_address(_.full_address)
)
print(t)
print(time() - parse_start)
```

Current implementation takes `4.291484832763672` seconds, the above one just `0.206512451171875` seconds.

My laptop specs:

```text
OS: Ubuntu 21.04 x86_64 
Host: Latitude 5500 
Kernel: 5.11.0-49-generic 
CPU: Intel i7-8665U (8) @ 4.800GHz 
GPU: Intel WhiskeyLake-U GT2 [UHD Graphics 620] 
Memory: 8067MiB / 15813MiB 
```

@NickCrews @jstammers:
 * can you replicate this behaviour on your data sets?
 * do you have a public data set for official benchmarks? 
 * how would you like to implement the official benchmarks?
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address parsing performance #47

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Address parsing performance #47

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions