Skip to content

Conversation

@petern48
Copy link
Collaborator

@petern48 petern48 commented Nov 9, 2025

Use file metadata cache for geoparquet

closes #250

@petern48
Copy link
Collaborator Author

petern48 commented Nov 9, 2025

Well that was suspiciously simple...

Comment on lines +187 to +188
let file_metadata_cache =
state.runtime_env().cache_manager.get_file_metadata_cache();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_file_metadata_cache() is called for each iteration of the loop (.map()), we need a clone for each separate iteration. get_file_metadata_cache() returns a cloned Arc, already, so no need to call another .clone().

https://github.com/apache/datafusion/blob/28755b1d7eb5222a8f5fb5417134dd6865ac1311/datafusion/execution/src/cache/cache_manager.rs#L174-L176

@petern48
Copy link
Collaborator Author

petern48 commented Nov 9, 2025

It would be nice to see how effective this is, with some benchmarks involving geoparquet reading. That doesn't seem to exist yet, does it?

@petern48 petern48 marked this pull request as ready for review November 9, 2025 23:45
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Can you also the with_file_metadata_cache() line here?

let parquet_metadata =
DFParquetMetadata::new(&self_clone.object_store, &file_meta.object_meta)
.with_metadata_size_hint(self_clone.metadata_size_hint)
.fetch_metadata()
.await?;

This seems a bit slower than on main at the moment and 0.1 and I don't see an impact on multiple calls to read_parquet(). Querying a big table with lots of Parquet files is probably a good way to check (but it might be better to list files individually for this particular benchmark, since there's also a cost to querying S3 to list the files).

import os
os.environ["AWS_SKIP_SIGNATURE"] = "true"
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
import sedona.db

sd = sedona.db.connect()

# 16s on main, 22s on this PR
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings")

# Second time: 16s on main, 19s on this PR
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)

From https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/ , I wonder if there's something in the ListingTable we need to port over to SedonaContext::read_parquet() as well (or something we need to propagate in the FileConfig).

@petern48
Copy link
Collaborator Author

Seems like you're right. Here's what I got in terms of numbers (4 trials each). (Before adding the extra changes)

# On branch:
20.807693004608154 seconds. 19.300298929214478 seconds. 30.610652208328247. 20.58704400062561

# On main:
19.879024744033813 seconds. 17.96437692642212 seconds. 21.73752498626709. 17.080259084701538.

Need more time to investigate.

@petern48 petern48 marked this pull request as draft November 12, 2025 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use metadata cache when fetching metadata in geoparquet reader

2 participants