Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions adlfs/spec.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
"is_current_version",
]
_ROOT_PATH = "/"
_DEFAULT_BLOCK_SIZE = 4 * 1024 * 1024
_DEFAULT_BLOCK_SIZE = 50 * 2**20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make sure to update the changelog in this PR. I'm thinking we can just use the three bullet points from this comment I made: #509 (comment) and make the wording a bit more succinct.


_SOCKET_TIMEOUT_DEFAULT = object()

Expand Down Expand Up @@ -177,8 +177,7 @@ class AzureBlobFileSystem(AsyncFileSystem):
The credentials with which to authenticate. Optional if the account URL already has a SAS token.
Can include an instance of TokenCredential class from azure.identity.aio.
blocksize: int
The block size to use for download/upload operations. Defaults to hardcoded value of
``BlockBlobService.MAX_BLOCK_SIZE``
The block size to use for download/upload operations. Defaults to 50 MiB
client_id: str
Client ID to use when authenticating using an AD Service Principal client/secret.
client_secret: str
Expand Down Expand Up @@ -1879,6 +1878,8 @@ def _open(
is versioning aware and blob versioning is enabled on the releveant container.
"""
logger.debug(f"_open: {path}")
if block_size is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure to update the block_size docstring for this _open() method to:

  1. Include "uploads" in the wording. Look like it only refers downloads
  2. Update the wording to say that the parameter is an override and when not provided defaults to the blocksize on the file system.

block_size = self.blocksize
if not self.version_aware and version_id:
raise ValueError(
"version_id cannot be specified if the filesystem "
Expand All @@ -1901,7 +1902,7 @@ def _open(
class AzureBlobFile(AbstractBufferedFile):
"""File-like operations on Azure Blobs"""

DEFAULT_BLOCK_SIZE = 5 * 2**20
DEFAULT_BLOCK_SIZE = _DEFAULT_BLOCK_SIZE

def __init__(
self,
Expand Down Expand Up @@ -2146,7 +2147,7 @@ async def _async_initiate_upload(self, **kwargs):

_initiate_upload = sync_wrapper(_async_initiate_upload)

def _get_chunks(self, data, chunk_size=1024**3): # Keeping the chunk size as 1 GB
def _get_chunks(self, data, chunk_size):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through this more, it probably makes sense to just remove the chunk_size parameter from this helper method all together and just switch the sole reference to chunk_size with the instance property self.blocksize. Mainly, we are not really overriding this value now and the block size is already set in the constructor. So it makes it simpler.

start = 0
length = len(data)
while start < length:
Expand All @@ -2173,7 +2174,7 @@ async def _async_upload_chunk(self, final: bool = False, **kwargs):
commit_kw["headers"] = {"If-None-Match": "*"}
if self.mode in {"wb", "xb"}:
try:
for chunk in self._get_chunks(data):
for chunk in self._get_chunks(data, self.blocksize):
async with self.container_client.get_blob_client(
blob=self.blob
) as bc:
Expand Down
36 changes: 36 additions & 0 deletions adlfs/tests/test_spec.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
import datetime
import math
import os
import tempfile
from unittest import mock
from unittest.mock import patch

import azure.storage.blob.aio
import dask.dataframe as dd
Expand Down Expand Up @@ -2045,3 +2047,37 @@ def test_open_file_x(storage: azure.storage.blob.BlobServiceClient, tmpdir):
with fs.open("data/afile", "xb") as f:
pass
assert fs.cat_file("data/afile") == b"data"


@pytest.mark.parametrize("blocksize", [5 * 2**20, 50 * 2**20, 100 * 2**20])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be fine just making this a non-parameterized test. Not sure if there is much value in setting different block sizes. I'd say we just set the blocksize to somewhere in single MiB's and write() data that requires several blocks and maybe also make sure the last block does not completely fit completely into a block (e.g., is less than blocksize) to make sure the logic handles sizes that do not fit on block boundaries.

This will hopefully help simplify the scaffolding for the case and also take less time to run.

def test_number_of_blocks(storage, mocker, blocksize):

fs = AzureBlobFileSystem(
account_name=storage.account_name,
connection_string=CONN_STR,
blocksize=blocksize,
)

content = b"1" * (blocksize * 2 + 1)
with fs.open("data/root/a/file.txt", "wb", blocksize=blocksize) as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove blocksize from this call here as I'm not sure if this will affect anything since the argument used is block_size and it should suffice to set it in the file system.

mocker.patch(
"azure.storage.blob.aio.BlobClient.commit_block_list", autospec=True
)
with patch(
"azure.storage.blob.aio.BlobClient.stage_block", autospec=True
) as mock_stage_block:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these patch statements, is there a particular reason why:

  1. mocker.patch is not being used for both?
  2. We are not following the pattern from other test cases where we import the BlobClient first e.g., here and patching it directly?

f.write(content)
expected_blocks = math.ceil(len(content) / blocksize)
actual_blocks = mock_stage_block.call_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also probably be worth asserting the actual data sizes used in each stage_block call.

assert actual_blocks == expected_blocks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are patching the commit block list call as well, it may be worth to also assert the number of blocks in that call also match the expected number of blocks.



def test_block_size(storage):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth parameterizing this one to test out the different permutations of setting/omitting block sizes. To do this, we parameterize off of the input block size for both the file system and the fs.open() call and include the expected block size for both the file system and file object. We'd then add cases such as (feel free to adjust this or add more):

  1. Assert defaults when block size is not set for either the file system or file object
  2. Assert file system block size propagates to the file-like object
  3. Assert that we can override the block_size for the fs.open() call

fs = AzureBlobFileSystem(
account_name=storage.account_name,
connection_string=CONN_STR,
blocksize=5 * 2**20,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing cases where we override the block size. It may be worth using values that are unlikely/odd defaults (e.g. 7 * 2 ** 20) to make it more clear this is an override value as opposed to a possible default.

)

with fs.open("data/root/a/file.txt", "wb") as f:
assert f.blocksize == 5 * 2**20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add an additional parameterized test case where we set the block_size directly using the AzureBlobFile constructor and have cases that:

  1. Assert the default block size for both when block_size is ommitted and block_size is None.
  2. Assert block size is not inherited from the passed in fs. This one may make sense to implement as a separate test method in case the implementation is not clean as part of parameterization
  3. Assert that we can override block_size