-
Notifications
You must be signed in to change notification settings - Fork 110
Faster file-like writes #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ab37ff9
21fb656
df76154
208b3e7
2d277c3
d26d6c3
6686c93
5efa176
96573d5
7023bf4
7a05f94
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,8 @@ | ||
import datetime | ||
import math | ||
import os | ||
import random | ||
import string | ||
import tempfile | ||
from unittest import mock | ||
|
||
|
@@ -2140,3 +2142,36 @@ def test_blobfile_default_blocksize(storage): | |
"data/root/a/file.txt", | ||
) | ||
assert f.blocksize == 50 * 2**20 | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"max_concurrency, blob_size, blocksize", | ||
[ | ||
(None, 200 * 2**20, 5 * 2**20), | ||
(1, 51 * 2**20, 50 * 2**20), | ||
(4, 200 * 2**20, 50 * 2**20), | ||
(4, 49 * 2**20, 50 * 2**20), | ||
(4, 200 * 2**20, 5 * 2**20), | ||
], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are some good initial cases. I'm not sure how long these tests currently take but it would be interesting to add a case that sets the blocksize to 5MiB and a total blob size of 200 MiB and then have cases that:
I also like setting the block size lower for some of these cases to better stress the system to help catch any concurrency issues that could arise in the future. |
||
) | ||
def test_write_max_concurrency(storage, max_concurrency, blob_size, blocksize): | ||
fs = AzureBlobFileSystem( | ||
account_name=storage.account_name, | ||
connection_string=CONN_STR, | ||
blocksize=blocksize, | ||
max_concurrency=max_concurrency, | ||
) | ||
data = os.urandom(blob_size) | ||
container_name = "".join(random.choices(string.ascii_lowercase, k=10)) | ||
fs.mkdir(container_name) | ||
path = f"{container_name}/blob.txt" | ||
|
||
with fs.open(path, "wb") as f: | ||
f.write(data) | ||
|
||
assert fs.exists(path) | ||
assert fs.size(path) == blob_size | ||
|
||
with fs.open(path, "rb") as f: | ||
assert f.read() == data | ||
fs.rm(container_name, recursive=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: this effectively makes a "pool", whereas fsspec upstream has a run-coroutines-in-chunks ability. The latter is less efficient but easier to reason about. It may be worthwhile upstreaming this at some point. See a similar idea in fsspec/filesystem_spec#1908 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( the chunks version: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py#L204 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstreaming makes sense to me. We originally did not look upstream in fsspec core to see if there was anything to leverage, but I think generalizing this semaphore approach upstream would definitely be useful to enable concurrency in more places and not need to reduplicate the logic.