Skip to content

added logic to create file references and added logging#4

Merged
sherwoodf merged 2 commits intomainfrom
file_ref_maker
Apr 17, 2025
Merged

added logic to create file references and added logging#4
sherwoodf merged 2 commits intomainfrom
file_ref_maker

Conversation

@sherwoodf
Copy link
Contributor

@sherwoodf sherwoodf requested a review from kbab April 15, 2025 10:48
file_path: str, study_uuid: str, dataset_uuid: str, crate_path: pathlib.Path
) -> list[APIModels.FileReference]:

relative_path = pathlib.Path(file_path).relative_to(crate_path).as_posix()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using as_posix because i'm assuming our system will be posix & we don't want this (and therefore the UUID) to vary depending on whether a windows system was used to ingest the objects.

Copy link

@kbab kbab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments - also not sure if you are currently writing to an API. I think the API versions start from 0 not 1. If this is the case the version in the functions converting ROCrate models to API ones needs to change.

"file_path": str(relative_path),
"version": 1,
"size_in_bytes": pathlib.Path(file_path).stat().st_size,
"format": pathlib.Path(file_path).suffix,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I think in the original bia_shared_models this property was from biostudies - which distinguished between file / directory (and I think we also used this in the past to flag files in zip archives).

During biostudies ingest we take this from the BioStudiesAPIFile object and its value is usually 'file'. However, EMPIAR ingest populates this with the suffix of the file path (compare value of format from example from biostudies ingest with EMPIAR ingest).

If we decide to go this way it would be useful to know whether to identify special suffixes e.g. (ome.zarr, nii.gz, ome.zarr.zip, zarr.zip, ome.tiff) and whether to standardise suffixes (e.g. TIF vs tif vs tiff vs TIFF) - we have a function for this when creating images/image_representations

assert cli_out == expected_out
# Account for different ordering of JSON objects due to file reference order being somewhat arbitrary.
assert len(cli_out) == len(expected_out)
for json_obj in cli_out:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a scenario where two identical objects are created instead of two distinct ones? e.g. If we have FileReference1 twice instead of FileReference1 and FileReference2. In such a scenario the above will pass the test. However, if the for loop and following assertion explicitly go through all expected objects, the test will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting edge case - will update as you suggest

@sherwoodf sherwoodf requested a review from kbab April 17, 2025 13:57
Copy link

@kbab kbab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sherwoodf sherwoodf merged commit 9f18380 into main Apr 17, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants