perf: Use file_metadata_cache in geoparquet #294

petern48 · 2025-11-09T23:06:42Z

Use file metadata cache for geoparquet

closes #250

petern48 · 2025-11-09T23:06:59Z

Well that was suspiciously simple...

petern48 · 2025-11-09T23:10:28Z

rust/sedona-geoparquet/src/format.rs

+                let file_metadata_cache =
+                    state.runtime_env().cache_manager.get_file_metadata_cache();


with_file_metadata_cache() is called for each iteration of the loop (.map()), we need a clone for each separate iteration. get_file_metadata_cache() returns a cloned Arc, already, so no need to call another .clone().

https://github.com/apache/datafusion/blob/28755b1d7eb5222a8f5fb5417134dd6865ac1311/datafusion/execution/src/cache/cache_manager.rs#L174-L176

petern48 · 2025-11-09T23:15:55Z

It would be nice to see how effective this is, with some benchmarks involving geoparquet reading. That doesn't seem to exist yet, does it?

paleolimbot

Thank you!

Can you also the with_file_metadata_cache() line here?

sedona-db/rust/sedona-geoparquet/src/file_opener.rs

Lines 116 to 120 in f49016d

    
           let parquet_metadata = 
        
               DFParquetMetadata::new(&self_clone.object_store, &file_meta.object_meta) 
        
                   .with_metadata_size_hint(self_clone.metadata_size_hint) 
        
                   .fetch_metadata() 
        
                   .await?;

This seems a bit slower than on main at the moment and 0.1 and I don't see an impact on multiple calls to read_parquet(). Querying a big table with lots of Parquet files is probably a good way to check (but it might be better to list files individually for this particular benchmark, since there's also a cost to querying S3 to list the files).

import os
os.environ["AWS_SKIP_SIGNATURE"] = "true"
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
import sedona.db

sd = sedona.db.connect()

# 16s on main, 22s on this PR
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings")

# Second time: 16s on main, 19s on this PR
sd.read_parquet(
    "s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/"
).to_view("buildings", overwrite=True)

From https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/ , I wonder if there's something in the ListingTable we need to port over to SedonaContext::read_parquet() as well (or something we need to propagate in the FileConfig).

petern48 · 2025-11-12T05:58:26Z

Seems like you're right. Here's what I got in terms of numbers (4 trials each). (Before adding the extra changes)

# On branch:
20.807693004608154 seconds. 19.300298929214478 seconds. 30.610652208328247. 20.58704400062561

# On main:
19.879024744033813 seconds. 17.96437692642212 seconds. 21.73752498626709. 17.080259084701538.

Need more time to investigate.

Use file_metadata_cache in geoparquet

7d2ac9e

petern48 commented Nov 9, 2025

View reviewed changes

petern48 added the performance label Nov 9, 2025

petern48 marked this pull request as ready for review November 9, 2025 23:45

paleolimbot reviewed Nov 10, 2025

View reviewed changes

petern48 marked this pull request as draft November 12, 2025 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Use file_metadata_cache in geoparquet #294

perf: Use file_metadata_cache in geoparquet #294

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

petern48 Nov 9, 2025

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

paleolimbot left a comment

Uh oh!

petern48 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let file_metadata_cache =
		state.runtime_env().cache_manager.get_file_metadata_cache();

	let parquet_metadata =
	DFParquetMetadata::new(&self_clone.object_store, &file_meta.object_meta)
	.with_metadata_size_hint(self_clone.metadata_size_hint)
	.fetch_metadata()
	.await?;

perf: Use file_metadata_cache in geoparquet #294

Are you sure you want to change the base?

perf: Use file_metadata_cache in geoparquet #294

Uh oh!

Conversation

petern48 commented Nov 9, 2025

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

petern48 Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

petern48 commented Nov 9, 2025

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

petern48 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants