Skip to content

Conversation

kylebarron
Copy link
Member

@kylebarron kylebarron commented Aug 29, 2025

Currently, obstore only supports listing by path segments. So if you pass a prefix into list_with_delimiter or list, that will be assumed to be a full path segment. This means that it's currently impossible to do efficiently perform the desired query from #494:

I have tons of log files with data at the beginning for example 202506272215_blabla on S3 I can use prefix as substring, basically I can get all files for this day by my_folder/20250627* but it's not working in obstore.

object_store supports substring-based prefix listing in its PaginatedListStore API. So if I use that and provide my own pagination -> stream conversion, then I should be able to essentially match the current list API.

However, this PaginatedListStore is only implemented for S3, Azure, and GCS. It's not implemented for HTTPStore or LocalStore, because those don't have a concept of pagination. See apache/arrow-rs-object-store#388.

This means that to support ...


... or, better idea, in obstore.list we:

  • Avoid type erasure, so instead of bringing in an Arc<dyn ObjectStore>, we have essentially an enum of the different stores
  • Implement S3/GCS/Azure via PaginatedListStore, to support efficient querying of substring prefix
  • Implement LocalStore/HTTPStore via a transform on the stream from ObjectStore::list, so that we never materialize the entire stream
    • That would mean removing this implementation of PaginatedListStore
    • Keep fetching the stream until a batch with valid responses exist, so that we don't return empty batches.

For now, as a first pass, we'll only use this to improve obstore.list, while not touching list_with_delimiter. Later we can explore making that return type a stream as well.

Closes #494

@kylebarron
Copy link
Member Author

The latest two commits added an implementation of substring-match list, partially written by Claude.

  • Fix clippy lints
  • Add test using minio? I need to have a python test that runs on the paginated implementation too
  • Propagate error through the stream
    Err(_e) => {
  • Make create_filtered_stream more concise. Did I write a helper for that initially?
  • Use set equality for paths in Python tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wildcard prefix in the list command

1 participant