-
Notifications
You must be signed in to change notification settings - Fork 30
perf: Use file_metadata_cache in geoparquet #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Well that was suspiciously simple... |
| let file_metadata_cache = | ||
| state.runtime_env().cache_manager.get_file_metadata_cache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_file_metadata_cache() is called for each iteration of the loop (.map()), we need a clone for each separate iteration. get_file_metadata_cache() returns a cloned Arc, already, so no need to call another .clone().
|
It would be nice to see how effective this is, with some benchmarks involving geoparquet reading. That doesn't seem to exist yet, does it? |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Can you also the with_file_metadata_cache() line here?
sedona-db/rust/sedona-geoparquet/src/file_opener.rs
Lines 116 to 120 in f49016d
| let parquet_metadata = | |
| DFParquetMetadata::new(&self_clone.object_store, &file_meta.object_meta) | |
| .with_metadata_size_hint(self_clone.metadata_size_hint) | |
| .fetch_metadata() | |
| .await?; |
This seems a bit slower than on main at the moment and 0.1 and I don't see an impact on multiple calls to read_parquet(). Querying a big table with lots of Parquet files is probably a good way to check (but it might be better to list files individually for this particular benchmark, since there's also a cost to querying S3 to list the files).
import os
os.environ["AWS_SKIP_SIGNATURE"] = "true"
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
import sedona.db
sd = sedona.db.connect()
# 16s on main, 22s on this PR
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings")
# Second time: 16s on main, 19s on this PR
sd.read_parquet(
"s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)From https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/ , I wonder if there's something in the ListingTable we need to port over to SedonaContext::read_parquet() as well (or something we need to propagate in the FileConfig).
|
Seems like you're right. Here's what I got in terms of numbers (4 trials each). (Before adding the extra changes) Need more time to investigate. |
Use file metadata cache for geoparquet
closes #250