Skip to content

Redundant metadata API calls in Azure Blob Storage operations #533

@pavelyanu

Description

@pavelyanu

Problem Description

The current CloudPath implementation makes multiple redundant metadata API calls during common operations like open(), download_to(), and copy(). Each call to exists(), is_file(), is_dir(), and stat() results in a separate _get_metadata() call to Azure Blob Storage, even though all these properties are available from a single metadata response.

What happens during an open() call

  • open() calls exists() + is_file()
  • _refresh_cache() calls stat()
  • download_to() calls exists() + is_file() again

On Azure, all of these end up calling the same AzureBlobClient._get_metadata(), which returns all the necessary information (existence, file/directory status, size, last modified time) in a single API call.

Performance Impact

After removing the redundant calls, I was able to achieve:

  • ~2× speedup for 1 MB downloads
  • ~1.5× speedup for 10 MB downloads

Proposal

There are two possible solutions:

Option 1: Azure-specific optimization

Optimize this in AzureBlobClient and AzureBlobPath.

Implementation:

  • Add _get_blob_properties() to AzureBlobClient that returns all the needed information in one call
  • Store the result of AzureBlobClient._get_blob_properties() at the start of e.g. AzureBlobPath.open()
  • Pass metadata between internal methods to avoid redundant calls
  • Alternatively implement metadata caching/invalidation logic

Example:

def open(self, mode="r", **kwargs):
    meta = self.client._get_blob_properties(self)  # Single call
    if meta.exists and meta.is_directory:
        raise CloudPathIsADirectoryError(...)
    if mode == "x" and meta.exists:
        raise CloudPathFileExistsError(...)
    self._refresh_cache_with_meta(meta, **kwargs)  # Reuse metadata
    # ... rest of implementation

Option 2: CloudPath optimization

Change Client API and optimize Cloudpath

  • Modify Client API to explicitly require _get_metadata() method that will fetch all the required data
  • Similar optimization to Cloudpath as described in option 1

PR for Option 1 coming

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions