Skip to content

Cannot fetch data from a data source because there is a file with size 0 and no name which does not pass the sanity check but I would like to fetch the other files #4641

@jkfindeisen

Description

@jkfindeisen

I'm using Windows, Python 3.11, quilt3 version 7.0.0 installed with pip. I want to fetch data from https://open.quiltdata.com/b/cellpainting-gallery/tree/cpg0023-mpi/mpi/images/Batch1/images/C2018-04-10.00-181207-A/2018-12-08/48519/ and go on get files / code / get files using the quilt3 Python API from this site.

However, this code:

import quilt3 as q3

if __name__ == "__main__":
    b = q3.Bucket("s3://cellpainting-gallery")
    b.fetch("cpg0023-mpi/mpi/images/Batch1/images/C2018-04-10.00-181207-A/2018-12-08/48519/",
            "file:///M:/Temporary/48519/")

throws:

Traceback (most recent call last):
  File "test.py", line 5, in <module>    
    b.fetch("cpg0023-mpi/mpi/images/Batch1/images/C2018-04-10.00-181207-A/2018-12-08/48519/",
  File "..\Lib\site-packages\quilt3\bucket.py", line 184, in fetch
    copy_file(source, dest)
  File "..\Lib\site-packages\quilt3\data_transfer.py", line 901, in copy_file
    sanity_check(rel_path)
  File "..\Lib\site-packages\quilt3\data_transfer.py", line 891, in sanity_check
    raise ValueError("Invalid relative path: %r" % rel_path)
ValueError: Invalid relative path: ''

Debugging and stopping at line 901 in method copy_file in data_transfer.py and inspecting the results of list_url(src) I see that the files to be copied are

[('', 0), ('181207_A01_s1_w12B5D7C20-9D24-4794-A524-9E4F2B881179.tif', 2337756), ('181207_A01_s1_w2058D7D88-B06E-4FBF-BAAD-C684F63EDD3E.tif', ...

That is, there is a file without a name and size 0, which will not pass the sanity_check, which instead throws an error.

I don't need such a file and for my purpose the sanity_check seems to be too strong while a better way might be to simply skip this empty file without a name.

I edited data_transfer.py lines 900 and following to

        for rel_path, size in list_url(src):
            if size > 0:
                sanity_check(rel_path)
                url_list.append((src.join(rel_path), dest.join(rel_path), size))

and that does the trick for me. No error is thrown and I can download all the files that I want (all the files with size > 0).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions