-
Notifications
You must be signed in to change notification settings - Fork 19
Add more wrappers for ISCC-SUM #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…t from file or text
…on in similarity check; add new test data file
|
@maartenpaul Do you need any help with this? |
|
@kostrykin Thanks, I'm discussing with @etzm and the others from ISCC which tools to implement for ISCC in Galaxy. |
Clarify the importance of dataset verification and remove redundant section on combined mode use case.
help txt
|
@kostrykin I have made some changes to the tools so it is possible to also calculate an ISCC for an entire file collection. Could you please review? |
kostrykin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much thanks!
Some comments inside.
| <param name="file1" type="data" format="data" label="First file"/> | ||
| <param name="file2" type="data" format="data" label="Second file"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests only cover cases for working with images, right? As of today, is it supposed to work with other data too? If so, it'd make sense to add some more tests with other data. Otherwise, I'd suggest to restrict the input formats to images for now, to avoid negative user experiences:
| <param name="file1" type="data" format="data" label="First file"/> | |
| <param name="file2" type="data" format="data" label="Second file"/> | |
| <param name="file1" type="data" format="tiff,png" label="First file"/> | |
| <param name="file2" type="data" format="tiff,png" label="Second file"/> |
Same also in lines 63, 66, 69, in tools/iscc-sum/iscc_verify.xml, and tools/iscc-sum/iscc_sum.xml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should work for any data. I have a fasta file in the test data, can make sure it is used in more different tests.
| <param name="file1" value="test1.png"/> | ||
| <param name="file2" value="test1.png"/> | ||
| </conditional> | ||
| <param name="threshold" value="12"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tighter testing, wouldn't it make sense to set this threshold to 0, since the two inputs are identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ISCC-SUM CLI tool the default threshold for similarity is set to 12. Only when files are similar it will return their similarity. As we discussed below I will revise the output. I think it should always output the similarity score for all files
| <param name="threshold" value="12"/> | ||
| <output name="output_file"> | ||
| <assert_contents> | ||
| <has_text text="~00"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of the tool for this test case is:
ISCC:K4AOMGOGQJA4Y46PAC4YPPA63GKD5RVFPR7FU3I4OOEW44TYXNYOTMY *input_files/test1
ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2
~00 ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2 (1)
We can make the tests a bit stricter:
| <has_text text="~00"/> | |
| <has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file1$"/> | |
| <has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file2$"/> | |
| <has_line_matching expression="^ +\~00 ISCC:[A-Z0-9]+ \*input_files/file3$"/> |
I guess, something similar can be done for lines 94 and 137.
| <conditional name="input_type"> | ||
| <param name="input_selector" type="select" label="Input type"> | ||
| <option value="single">Single file</option> | ||
| <option value="collection">Collection of files</option> | ||
| </param> | ||
| <when value="single"> | ||
| <param name="input_file" type="data" format="data" label="File to verify"/> | ||
| </when> | ||
| <when value="collection"> | ||
| <param name="input_collection" type="data_collection" collection_type="list" format="data" | ||
| label="Collection to verify" | ||
| help="Collection of files - will generate combined ISCC hash"/> | ||
| </when> | ||
| </conditional> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little follow-up on #162 (comment).
If you're open for further simplification, you can actually use a single input, like
| <conditional name="input_type"> | |
| <param name="input_selector" type="select" label="Input type"> | |
| <option value="single">Single file</option> | |
| <option value="collection">Collection of files</option> | |
| </param> | |
| <when value="single"> | |
| <param name="input_file" type="data" format="data" label="File to verify"/> | |
| </when> | |
| <when value="collection"> | |
| <param name="input_collection" type="data_collection" collection_type="list" format="data" | |
| label="Collection to verify" | |
| help="Collection of files - will generate combined ISCC hash"/> | |
| </when> | |
| </conditional> | |
| <param name="input" type="data" format="data" multiple="true" label="Dataset(s) to verify"/> |
and completely avoid the conditional (docs).
This allows using a single dataset, multiple datasets, or a collection:
| <conditional name="input_type"> | ||
| <param name="input_selector" type="select" label="Input type"> | ||
| <option value="single">Single file</option> | ||
| <option value="collection">Collection of files</option> | ||
| </param> | ||
| <when value="single"> | ||
| <param name="input_file" type="data" format="data" label="Input File" | ||
| help="Any file type - ISCC-SUM will generate a checksum and similarity hash"/> | ||
| </when> | ||
| <when value="collection"> | ||
| <param name="input_collection" type="data_collection" collection_type="list" format="data" | ||
| label="Input File Collection" | ||
| help="A collection of files - ISCC-SUM will generate checksum and similarity hash"/> | ||
| <param name="calculate_individual" type="boolean" truevalue="true" falsevalue="false" checked="true" | ||
| label="Calculate ISCC hash for each file individually" | ||
| help="If selected, generates one ISCC hash per file. If not selected, calculates a single ISCC hash for all files combined."/> | ||
| </when> | ||
| </conditional> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment https://github.com/BMCV/galaxy-image-analysis/pull/162/files#r2580605041 on tools/iscc-sum/iscc_verify.xml lines 59–72
| **What it does** | ||
| Generates an International Standard Content Code (ISCC) based checksum and similarity hash from any input file. | ||
| Generates an International Standard Content Code (ISCC) based checksum and similarity hash from a single file or collection of files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd adapt the terminology of "files" to "datasets" that are more common in Galaxy.
| <help><![CDATA[ | ||
| **What it does** | ||
| Verifies that a file or collection matches an expected ISCC hash for exact checksum verification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Verifies that a file or collection matches an expected ISCC hash for exact checksum verification. | |
| Verifies that a file or collection matches an expected International Standard Content Code (ISCC) hash for exact checksum verification. |
|
Also, for the |
Co-authored-by: Leonid Kostrykin <[email protected]>
|
kostrykin Thanks for your comments. Will try to work on those. For the output of the ISCC similarity tool, I agree that it is not very convenient. I guess we should limit the input to comparing one-to-one and one-to-many to avoid complex output. Then we can put it in a table. |
Good question. If the input is a list of datasets, then you might identify the datasets/files by their position in the collection. For collections in general, there also is the Maybe @bgruening can add something on this? |
|
|
Co-authored-by: Leonid Kostrykin <[email protected]>
Co-authored-by: Leonid Kostrykin <[email protected]>
This PR adds two new tools to the ISCC Galaxy suite. Namely a tool to veryify if a file matches an ISCC hash and a tool to compare two or a set of files together.
Check-list for the contributor
Please make sure you have read the CONTRIBUTING.md document (last updated: 2024/04/23).
Please fill out if applicable:
If this PR adds or updates a tool or tool collection:
Guidelines for the contributor
This section is cited from the Naming and Annotation Conventions for Tools in the Image Community in Galaxy.
Naming
Generally, the name of Galaxy tools in our community should be expressive and concise, while stating the purpose of the tool as precisely as possible. Consistency of the namings of Galaxy tools is important to ensure they can be found easily. To maintain consistency, we consider phrasing names as imperatives a good practice, such as "Analyze particles" or "Perform segmentation using watershed transformation". An acknowledged exception from this rule is the names of tool wrappers of major tool suites, where the name of a tool wrapper should be chosen identically to the module or function of the tool which is wrapped (e.g., "MaskImage" in CellProfiler).
Tool description
If a Galaxy tool is a thin tool wrapper (e.g, part of a major tool suite), then the name of the wrapped tool (and only the name of the wrapped tool, subsequent to the term "with" as in "with Bioformats") should be used as the description of the tool (further examples include "with CellProfiler", "with ImageJ2", "with ImageMagick", "with SpyBOAT", "with SuperDSM"). This ensures that the tool is found by typing the name of the wrapped tool into the "Search" field on the Galaxy interface. The tool description should be empty if a tool is either not part of a major tool suite, or the main functionality of the tool is implemented in the wrapper.
Annotations
We point out that there is a global list of precedential annotations with Bio.tools identifiers (Ison et al., 2019) in Galaxy (see mappings), which may outweigh the annotations made in the XML specification of a Galaxy tool (and thus the annotations of a tool reported within the web interface of Galaxy might be divergent). However, since the precedential annotations are subject to possible changes and to avoid redundant work, we do not aim to reflect those in our specifications (those which we make in the XML specifications of Galaxy tools).