Skip to content

Conversation

mukhery
Copy link

@mukhery mukhery commented May 29, 2023

fsspec's generic filesystem _cp_file functionality checks the size of the file before reading it, causing the currently s3 async implementation to fail. This PR calls _info to retrieve and cache the file size when needed. Once this bug in fsspec is fixed (fsspec/filesystem_spec#1281) it should be possible to use the generic filesystem cp functionality with s3fs.

@mukhery
Copy link
Author

mukhery commented Jun 1, 2023

@martindurant in order to get the size here, needed by _cp_file, we have to call the async _info function, which then makes size async

@martindurant
Copy link
Member

How about #745 as an alternative without the async attribute? In that case, the size is only available after the first read, but requires no extra call.

@mukhery
Copy link
Author

mukhery commented Jun 5, 2023

How about #745 as an alternative without the async attribute? In that case, the size is only available after the first read, but requires no extra call.

Wouldn't the workflow of checking size and then transferring data not work? https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L283
I imagine some like this also enables rsync-like capabilities, where you check size and then transfer only if the size doesn't match.

@martindurant
Copy link
Member

You are right about that ...

However, the rsync idea you mention is already implemented in https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L36 and includes getting all the info for all the files ahead of time. For s3, calling find once will be much faster that calling info on each file, even if done asynchronously.

@mukhery
Copy link
Author

mukhery commented Jun 5, 2023

You are right about that ...

However, the rsync idea you mention is already implemented in https://github.com/fsspec/filesystem_spec/blob/386a084ffb7f8194265056e19f53ffd252a89e20/fsspec/generic.py#L36 and includes getting all the info for all the files ahead of time. For s3, calling find once will be much faster that calling info on each file, even if done asynchronously.

That makes sense. Looking at the _cp_file code again, maybe just adding self.size = None in the __init__ for #745 will allow it to work as well, because it will call read once and then subsequent loop iterations will have size defined.

@martindurant
Copy link
Member

OK, I did that.

@mukhery
Copy link
Author

mukhery commented Jun 5, 2023

closing in favor of #745

@mukhery mukhery closed this Jun 5, 2023
@mukhery mukhery deleted the add_size branch June 11, 2023 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants