-
Notifications
You must be signed in to change notification settings - Fork 110
Faster file-like writes #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
ab37ff9
21fb656
df76154
208b3e7
2d277c3
d26d6c3
6686c93
5efa176
96573d5
7023bf4
7a05f94
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2156,6 +2156,15 @@ def _get_chunks(self, data): | |
yield data[start:end] | ||
start = end | ||
|
||
async def _upload(self, chunk, block_id, semaphore): | ||
|
||
async with semaphore: | ||
async with self.container_client.get_blob_client(blob=self.blob) as bc: | ||
await bc.stage_block( | ||
block_id=block_id, | ||
data=chunk, | ||
length=len(chunk), | ||
) | ||
|
||
async def _async_upload_chunk(self, final: bool = False, **kwargs): | ||
""" | ||
Write one part of a multi-block file upload | ||
|
@@ -2175,17 +2184,22 @@ async def _async_upload_chunk(self, final: bool = False, **kwargs): | |
commit_kw["headers"] = {"If-None-Match": "*"} | ||
if self.mode in {"wb", "xb"}: | ||
try: | ||
for chunk in self._get_chunks(data): | ||
async with self.container_client.get_blob_client( | ||
blob=self.blob | ||
) as bc: | ||
await bc.stage_block( | ||
block_id=block_id, | ||
data=chunk, | ||
length=len(chunk), | ||
) | ||
self._block_list.append(block_id) | ||
block_id = self._get_block_id(self._block_list) | ||
max_concurrency = self.fs.max_concurrency | ||
|
||
semaphore = asyncio.Semaphore(max_concurrency) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI: this effectively makes a "pool", whereas fsspec upstream has a run-coroutines-in-chunks ability. The latter is less efficient but easier to reason about. It may be worthwhile upstreaming this at some point. See a similar idea in fsspec/filesystem_spec#1908 . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ( the chunks version: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py#L204 ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Upstreaming makes sense to me. We originally did not look upstream in fsspec core to see if there was anything to leverage, but I think generalizing this semaphore approach upstream would definitely be useful to enable concurrency in more places and not need to reduplicate the logic. |
||
tasks = [] | ||
block_ids = self._block_list or [] | ||
start_idx = len(block_ids) | ||
chunks = list(self._get_chunks(data)) | ||
|
||
for _ in range(len(chunks)): | ||
block_ids.append(block_id) | ||
block_id = self._get_block_id(block_ids) | ||
|
||
if chunks: | ||
self._block_list = block_ids | ||
|
||
for chunk, block_id in zip(chunks, block_ids[start_idx:]): | ||
tasks.append(self._upload(chunk, block_id, semaphore)) | ||
|
||
await asyncio.gather(*tasks) | ||
|
||
|
||
if final: | ||
block_list = [BlobBlock(_id) for _id in self._block_list] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2140,3 +2140,31 @@ def test_blobfile_default_blocksize(storage): | |
"data/root/a/file.txt", | ||
) | ||
assert f.blocksize == 50 * 2**20 | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"max_concurrency, blob_size", | ||
[ | ||
(1, 51 * 2**20), | ||
(4, 200 * 2**20), | ||
(4, 49 * 2**20), | ||
], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are some good initial cases. I'm not sure how long these tests currently take but it would be interesting to add a case that sets the blocksize to 5MiB and a total blob size of 200 MiB and then have cases that:
I also like setting the block size lower for some of these cases to better stress the system to help catch any concurrency issues that could arise in the future. |
||
) | ||
def test_max_concurrency(storage, max_concurrency, blob_size): | ||
|
||
fs = AzureBlobFileSystem( | ||
account_name=storage.account_name, | ||
connection_string=CONN_STR, | ||
max_concurrency=max_concurrency, | ||
) | ||
data = os.urandom(blob_size) | ||
fs.mkdir("large-file-container") | ||
|
||
path = "large-file-container/blob.txt" | ||
|
||
with fs.open(path, "wb") as f: | ||
f.write(data) | ||
|
||
assert fs.exists(path) | ||
assert fs.size(path) == blob_size | ||
|
||
with fs.open(path, "rb") as f: | ||
assert f.read() == data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the changelog entry, let's avoid mentioning private methods and instead make it worded in the perspective of someone who is consuming
adlfs
. For example: