Skip to content

Conversation

bnavigator
Copy link
Contributor

This PR addresses two limitations of the Native API uploader:

  • The Native API does not allow to replace files with the same file content:
>>> import requests
... import json
... from pprint import pprint
...
... from dvuploader.utils import retrieve_dataset_files
...
... # this is used for the API url and the api_token
... from sdc_darus import config as cfg
...
>>> filename='randomdata.bin'
>>>
KeyboardInterrupt
>>>
(.darusvenv) [ben@skylab:~/Engineering/src/darus]%                                                                                                                       [0]
(.darusvenv) [ben@skylab:~/Engineering/src/darus]% python                                                                                                                [0]
Python 3.13.5 (main, Jun 11 2025, 22:06:31) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
... import json
... from pprint import pprint
...
... from dvuploader.utils import retrieve_dataset_files
...
... # this is used for the API url and the api_token
... from sdc_darus import config as cfg
...
>>> # Check the current content of the dataset:
>>> dspid = 'doi:10.80827/DARUS-2450'
... dsfiles = retrieve_dataset_files(cfg.darus_url, dspid, cfg.api_token)
... pprint(dsfiles)
...
[{'dataFile': {'checksum': {'type': 'MD5',
                            'value': 'f9014100615fe884ed8ec779e60b6021'},
               'contentType': 'application/octet-stream',
               'creationDate': '2025-08-05',
               'description': 'random data initially',
               'fileAccessRequest': False,
               'filename': 'randomdata.bin',
               'filesize': 2048,
               'friendlyType': 'Unknown',
               'id': 16878,
               'md5': 'f9014100615fe884ed8ec779e60b6021',
               'persistentId': 'doi:10.80827/DARUS-2450/1',
               'pidURL': 'https://doi.org/10.80827/DARUS-2450/1',
               'rootDataFileId': -1,
               'storageIdentifier': 's3://dv-stage-1:1987c03afac-baa9f8fa84bd',
               'tabularData': False},
  'datasetVersionId': 2298,
  'description': 'random data initially',
  'label': 'randomdata.bin',
  'restricted': False,
  'version': 1}]
>>> # Replace it with some new data:
... with open('randomdata2.bin',"rb") as fh:
...     file_content = fh.read()
...
>>> filename = dsfiles[0]['dataFile']['filename']
... pid = dsfiles[0]['dataFile']['persistentId']
... contenttype = dsfiles[0]['dataFile']['contentType']
... metadata = {
...     'description': 'This has now new content',
...     'categories': ['Some', 'new', 'tags'],
... }
...
>>> s = requests.Session()
... s.headers.update({'X-Dataverse-key': cfg.api_token})
... r = s.post(
...     f"{cfg.darus_url}api/files/:persistentId/replace",
...     params={'persistentId': pid},
...     files={'jsonData': json.dumps(metadata),
...            'file': (filename, file_content, contenttype)})
...
>>> # This worked, because it is new data
>>> r.status_code
200
>>> pprint(r.text)
('{"status":"OK","data":{"files":[{"description":"This has now new '
 'content","label":"randomdata.bin","restricted":false,"version":1,"datasetVersionId":2298,"categories":["new","Some","tags"],"dataFile":{"id":16879,"persistentId":"doi:10.80827/DARUS-2450/1","pidURL":"https://doi.org/10.80827/DARUS-2450/1","filename":"randomdata.bin","contentType":"application/octet-stream","friendlyType":"Unknown","filesize":2048,"description":"This '
 'has now new '
 'content","categories":["new","Some","tags"],"storageIdentifier":"s3://dv-stage-1:1987c079792-8da0ba16dcd3","rootDataFileId":-1,"md5":"013d018061b5c200f34d51bd159178d2","checksum":{"type":"MD5","value":"013d018061b5c200f34d51bd159178d2"},"tabularData":false,"creationDate":"2025-08-05","fileAccessRequest":false}}]}}')
>>> # Now try again with new metadata but same content
>>> metadata['description'] = 'We have new metadata!'
... r = s.post(
...     f"{cfg.darus_url}api/files/:persistentId/replace",
...     params={'persistentId': pid},
...     files={'jsonData': json.dumps(metadata),
...            'file': (filename, file_content, contenttype)})
... r.status_code
...
400
>>> pprint(r.text)
('{"status":"ERROR","message":"Error! You may not replace a file with a file '
 'that has duplicate content."}')
>>>

With this PR, the code checks if the file content has the same hash and only updates the metadata in this case.

  • When Uploading as a zipped package, files do not get replaced but added with a -N.ext filename

With this PR, the code uploads the files to be replaced with new content individually instead of with a zipped package.

Example Output:

image

Also, the restricted property of a file should not get lost, when updating the metadata.

@bnavigator
Copy link
Contributor Author

bnavigator commented Aug 6, 2025

Hmm, the seems to be some side-effect for the filepath to directory_label handling.

@bnavigator
Copy link
Contributor Author

Actually, no. The same test fails with no code change to main at all:
https://github.com/bnavigator/python-dvuploader/actions/runs/16771171203/job/47486673646

@bnavigator
Copy link
Contributor Author

There must be a bug in a recent change for the dataverse test container. Otherwise I cannot fathom why the zip upload now fails to create the correct directoryLabel for subdir/file.txt:

https://github.com/gdcc/python-dvuploader/compare/main...bnavigator:python-dvuploader:main-debug?expand=1

diff --git a/dvuploader/nativeupload.py b/dvuploader/nativeupload.py
index 7296c86..4754724 100644
--- a/dvuploader/nativeupload.py
+++ b/dvuploader/nativeupload.py
@@ -91,6 +91,8 @@ async def native_upload(
         "proxy": proxy,
     }
 
+    rich.print(files)
+
     async with httpx.AsyncClient(**session_params) as session:
         with tempfile.TemporaryDirectory() as tmp_dir:
             packages = distribute_files(files)
diff --git a/dvuploader/packaging.py b/dvuploader/packaging.py
index c99d4d1..18bc8f9 100644
--- a/dvuploader/packaging.py
+++ b/dvuploader/packaging.py
@@ -99,6 +99,8 @@ def zip_files(
                 zinfo_or_arcname=_create_arcname(file),
             )
 
+        zip_file.printdir()
+
     return path
=================================== FAILURES ===================================
________________ TestNativeUpload.test_native_upload_by_handler ________________
self = <tests.integration.test_native_upload.TestNativeUpload object at 0x7f22f441fc10>
credentials = ('http://localhost:8080/', 'ffdbdd3d-9886-4ca4-a394-dd515e411085')
    def test_native_upload_by_handler(
        self,
        credentials,
    ):
        BASE_URL, API_TOKEN = credentials
    
        # Arrange
        byte_string = b"Hello, World!"
        files = [
            File(
                filepath="subdir/file.txt",
                handler=BytesIO(byte_string),
                description="This is a test",
            ),
            File(
                filepath="biggerfile.txt",
                handler=BytesIO(byte_string * 10000),
                description="This is a test",
            ),
        ]
    
        # Create Dataset
        pid = create_dataset(
            parent="Root",
            server_url=BASE_URL,
            api_token=API_TOKEN,
        )
    
        # Act
        uploader = DVUploader(files=files)
        uploader.upload(
            persistent_id=pid,
            api_token=API_TOKEN,
            dataverse_url=BASE_URL,
            n_parallel_uploads=1,
        )
    
        # Assert
        expected = [
            ("", "biggerfile.txt"),
            ("subdir", "file.txt"),
        ]
    
        files = retrieve_dataset_files(
            dataverse_url=BASE_URL,
            persistent_id=pid,
            api_token=API_TOKEN,
        )
    
        assert len(files) == 2
    
        for ex_dir, ex_f in expected:
            file = next(file for file in files if file["label"] == ex_f)
    
            assert file["label"] == ex_f, (
                f"File label does not match for file {json.dumps(file)}"
            )
    
>           assert file.get("directoryLabel", "") == ex_dir, (
                f"Directory label does not match for file {json.dumps(file)}"
            )
E           AssertionError: Directory label does not match for file {"description": "", "label": "file.txt", "restricted": false, "version": 1, "datasetVersionId": 4, "categories": ["DATA"], "dataFile": {"id": 16, "persistentId": "", "filename": "file.txt", "contentType": "text/plain", "friendlyType": "Plain Text", "filesize": 13, "description": "", "categories": ["DATA"], "storageIdentifier": "local://1987e805c8a-db2a230316ec", "rootDataFileId": -1, "md5": "65a8e27d8879283831b664bd8b7f0ad4", "checksum": {"type": "MD5", "value": "65a8e27d8879283831b664bd8b7f0ad4"}, "tabularData": false, "creationDate": "2025-08-06", "lastUpdateTime": "2025-08-06T08:29:59Z", "fileAccessRequest": false}}
E           assert '' == 'subdir'
E             
E             - subdir
tests/integration/test_native_upload.py:218: AssertionError
----------------------------- Captured stdout call -----------------------------
╭────────── DVUploader ──────────╮
│ Server: http://localhost:8080/ │
│ PID: doi:10.5072/FK2/KL0M5O    │
│ Files: 2                       │
╰────────────────────────────────╯
🔎 Checking dataset files           
┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ File           ┃ Status ┃ Action ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ file.txt       │ New    │ Upload │
│ biggerfile.txt │ New    │ Upload │
└────────────────┴────────┴────────┘
⚠️  Direct upload not supported. Falling back to Native API.
🚀 Uploading files
[
    File(
        filepath='subdir/file.txt',
        handler=<_io.BytesIO object at 0x7f22f2dca250>,
        description='This is a test',
        directory_label='subdir',
        mimeType='application/octet-stream',
        categories=['DATA'],
        restrict=False,
        checksum_type=<ChecksumTypes.MD5: ('MD5', <built-in function 
openssl_md5>)>,
        storageIdentifier=None,
        file_name='file.txt',
        checksum=Checksum(type='MD5', value=None),
        to_replace=False,
        file_id=None,
        tab_ingest=True
    ),
    File(
        filepath='biggerfile.txt',
        handler=<_io.BytesIO object at 0x7f22f2dcb740>,
        description='This is a test',
        directory_label='',
        mimeType='application/octet-stream',
        categories=['DATA'],
        restrict=False,
        checksum_type=<ChecksumTypes.MD5: ('MD5', <built-in function 
openssl_md5>)>,
        storageIdentifier=None,
        file_name='biggerfile.txt',
        checksum=Checksum(type='MD5', value=None),
        to_replace=False,
        file_id=None,
        tab_ingest=True
    )
]
File Name                                             Modified             Size
subdir/file.txt                                2025-08-06 08:29:58           13
biggerfile.txt                                 2025-08-06 08:29:58       130000
(
    'File subdir/file.txt not found in Dataverse repository.',
    'This may be due to the file not being uploaded to the repository:'
)
package_0.zip ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
✅ Upload complete

@JR-1991
Copy link
Member

JR-1991 commented Aug 6, 2025

@bnavigator, thanks for submitting the PR! Indeed, this is quite strange, and I agree with you. The only thing that comes to mind as a possible reason for its failure could be related to the new Dataverse images. I’ll test it locally and check the logs to see if that helps with debugging.

@JR-1991
Copy link
Member

JR-1991 commented Aug 6, 2025

Oddly, the tests run smoothly on a local instance, suggesting that the Dataverse Action might be the issue. However, I’m not sure about the exact problem because, as far as I know, the directoryLabel is a column in an SQL table, not an actual file path. @pdurbin do you have an idea what is wrong here?

image

In this test run, I’ve added an additional test to explicitly check for directoryLabel, even though it’s implicitly covered by other tests. You never know 😄

…directory label. a first run is assured by _validate_files being run before.
@pdurbin
Copy link
Member

pdurbin commented Aug 18, 2025

The same test fails with no code change to main at all

Yeah, I just replicated this at https://github.com/gdcc/python-dvuploader/actions/runs/17045098506/job/48318703025?pr=37 via #37

There's nothing weird in my no-op PR is there? I tried to simply make the same change as @bnavigator, which was just print statements.

@JR-1991
Copy link
Member

JR-1991 commented Aug 18, 2025

I guess this issue seems to be exclusive to the Dataverse Action. I am using 6.7 locally and tests run without any problems. Maybe the compose file of the action needs an update? Or is it the same as pre 6.7?

image

@pdurbin
Copy link
Member

pdurbin commented Aug 20, 2025

I can reproduce the failure locally so I created a dedicated issue for it:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants