-
Notifications
You must be signed in to change notification settings - Fork 42
Open
Description
I've been playing with encoding larger datasets using the cf-xarray geometry approach. Here's some code
import geopandas as gp
import xvec
import xarray as xr
url = (
"s3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=building/"
"part-00000-2ad9544f-1d68-4a5a-805c-7a5d020d084d-c000.zstd.parquet"
)
# ~ 30s
df = gp.read_parquet(url, columns=['id', 'geometry', 'level'])
# 11447790 rows
# all very fast
ds = xr.Dataset(df).set_coords("geometry").swap_dims({"dim_0": "geometry"}).drop_vars("dim_0")
ds = ds.xvec.set_geom_indexes('geometry', crs=df.crs)
# encode only 100_000 rows
%time ds_enc = ds.isel(geometry=slice(0, 100_000)).xvec.encode_cf()
# -> Wall time: 4.25 s
I confirmed that the scaling is roughly linear. At this rate, it will take > 500 seconds to encode the whole dataset. Decoding is about 20x faster.
I'm wondering if there are some low-hanging fruit that can be found to optimize this.
Alternatively, we could explore storing geometries as WKB, as geoparquet does.
Metadata
Metadata
Assignees
Labels
No labels