Skip to content

scalability of geometry encoding #534

@rabernat

Description

@rabernat

I've been playing with encoding larger datasets using the cf-xarray geometry approach. Here's some code

import geopandas as gp
import xvec
import xarray as xr

url = (
    "s3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=building/"
    "part-00000-2ad9544f-1d68-4a5a-805c-7a5d020d084d-c000.zstd.parquet"
)

# ~ 30s
df = gp.read_parquet(url, columns=['id', 'geometry', 'level'])
# 11447790 rows

# all very fast
ds = xr.Dataset(df).set_coords("geometry").swap_dims({"dim_0": "geometry"}).drop_vars("dim_0")
ds = ds.xvec.set_geom_indexes('geometry', crs=df.crs)

# encode only 100_000 rows
%time ds_enc = ds.isel(geometry=slice(0, 100_000)).xvec.encode_cf()
# -> Wall time: 4.25 s

I confirmed that the scaling is roughly linear. At this rate, it will take > 500 seconds to encode the whole dataset. Decoding is about 20x faster.

I'm wondering if there are some low-hanging fruit that can be found to optimize this.

Alternatively, we could explore storing geometries as WKB, as geoparquet does.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions