-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Problem Description
The current CloudPath implementation makes multiple redundant metadata API calls during common operations like open(), download_to(), and copy(). Each call to exists(), is_file(), is_dir(), and stat() results in a separate _get_metadata() call to Azure Blob Storage, even though all these properties are available from a single metadata response.
What happens during an open() call
open()callsexists()+is_file()_refresh_cache()callsstat()download_to()callsexists()+is_file()again
On Azure, all of these end up calling the same AzureBlobClient._get_metadata(), which returns all the necessary information (existence, file/directory status, size, last modified time) in a single API call.
Performance Impact
After removing the redundant calls, I was able to achieve:
- ~2× speedup for 1 MB downloads
- ~1.5× speedup for 10 MB downloads
Proposal
There are two possible solutions:
Option 1: Azure-specific optimization
Optimize this in AzureBlobClient and AzureBlobPath.
Implementation:
- Add
_get_blob_properties()toAzureBlobClientthat returns all the needed information in one call - Store the result of
AzureBlobClient._get_blob_properties()at the start of e.g.AzureBlobPath.open() - Pass metadata between internal methods to avoid redundant calls
- Alternatively implement metadata caching/invalidation logic
Example:
def open(self, mode="r", **kwargs):
meta = self.client._get_blob_properties(self) # Single call
if meta.exists and meta.is_directory:
raise CloudPathIsADirectoryError(...)
if mode == "x" and meta.exists:
raise CloudPathFileExistsError(...)
self._refresh_cache_with_meta(meta, **kwargs) # Reuse metadata
# ... rest of implementationOption 2: CloudPath optimization
Change Client API and optimize Cloudpath
- Modify
ClientAPI to explicitly require_get_metadata()method that will fetch all the required data - Similar optimization to
Cloudpathas described in option 1
PR for Option 1 coming