scalability of geometry encoding

I've been playing with encoding larger datasets using the cf-xarray geometry approach. Here's some code

```python
import geopandas as gp
import xvec
import xarray as xr

url = (
    "s3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=building/"
    "part-00000-2ad9544f-1d68-4a5a-805c-7a5d020d084d-c000.zstd.parquet"
)

# ~ 30s
df = gp.read_parquet(url, columns=['id', 'geometry', 'level'])
# 11447790 rows

# all very fast
ds = xr.Dataset(df).set_coords("geometry").swap_dims({"dim_0": "geometry"}).drop_vars("dim_0")
ds = ds.xvec.set_geom_indexes('geometry', crs=df.crs)

# encode only 100_000 rows
%time ds_enc = ds.isel(geometry=slice(0, 100_000)).xvec.encode_cf()
# -> Wall time: 4.25 s
```

I confirmed that the scaling is roughly linear. At this rate, it will take > 500 seconds to encode the whole dataset. Decoding is about 20x faster.

I'm wondering if there are some low-hanging fruit that can be found to optimize this.

Alternatively, we could explore storing geometries as WKB, as geoparquet does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scalability of geometry encoding #534

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scalability of geometry encoding #534

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions