Skip to content

Conversation

@maartenpaul
Copy link
Contributor

@maartenpaul maartenpaul commented Nov 6, 2025

This PR adds two new tools to the ISCC Galaxy suite. Namely a tool to veryify if a file matches an ISCC hash and a tool to compare two or a set of files together.


Check-list for the contributor

Please make sure you have read the CONTRIBUTING.md document (last updated: 2024/04/23).

Please fill out if applicable:

  • License permits unrestricted use (educational + commercial).

If this PR adds or updates a tool or tool collection:

  • This PR adds a new tool or tool collection.
  • This PR updates an existing tool or tool collection.
  • Tools added/updated by this PR comply with the Guidelines below (or explain why they do not).

Guidelines for the contributor

This section is cited from the Naming and Annotation Conventions for Tools in the Image Community in Galaxy.

Naming

Generally, the name of Galaxy tools in our community should be expressive and concise, while stating the purpose of the tool as precisely as possible. Consistency of the namings of Galaxy tools is important to ensure they can be found easily. To maintain consistency, we consider phrasing names as imperatives a good practice, such as "Analyze particles" or "Perform segmentation using watershed transformation". An acknowledged exception from this rule is the names of tool wrappers of major tool suites, where the name of a tool wrapper should be chosen identically to the module or function of the tool which is wrapped (e.g., "MaskImage" in CellProfiler).

Tool description

If a Galaxy tool is a thin tool wrapper (e.g, part of a major tool suite), then the name of the wrapped tool (and only the name of the wrapped tool, subsequent to the term "with" as in "with Bioformats") should be used as the description of the tool (further examples include "with CellProfiler", "with ImageJ2", "with ImageMagick", "with SpyBOAT", "with SuperDSM"). This ensures that the tool is found by typing the name of the wrapped tool into the "Search" field on the Galaxy interface. The tool description should be empty if a tool is either not part of a major tool suite, or the main functionality of the tool is implemented in the wrapper.

Annotations

We point out that there is a global list of precedential annotations with Bio.tools identifiers (Ison et al., 2019) in Galaxy (see mappings), which may outweigh the annotations made in the XML specification of a Galaxy tool (and thus the annotations of a tool reported within the web interface of Galaxy might be divergent). However, since the precedential annotations are subject to possible changes and to avoid redundant work, we do not aim to reflect those in our specifications (those which we make in the XML specifications of Galaxy tools).

@kostrykin
Copy link
Member

@maartenpaul Do you need any help with this?

@maartenpaul
Copy link
Contributor Author

@kostrykin Thanks, I'm discussing with @etzm and the others from ISCC which tools to implement for ISCC in Galaxy.
So I would like to release this more extended ISCC suite together with the tutorial so it is clear what the use cases are for the different tools galaxyproject/training-material#6460 .

@maartenpaul maartenpaul marked this pull request as ready for review November 28, 2025 14:34
@maartenpaul
Copy link
Contributor Author

@kostrykin I have made some changes to the tools so it is possible to also calculate an ISCC for an entire file collection. Could you please review?

@kostrykin kostrykin self-assigned this Dec 2, 2025
@kostrykin kostrykin changed the title Iscc tools Add more wrappers for ISCC-SUM Dec 2, 2025
Copy link
Member

@kostrykin kostrykin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much thanks!

Some comments inside.

Comment on lines +58 to +59
<param name="file1" type="data" format="data" label="First file"/>
<param name="file2" type="data" format="data" label="Second file"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests only cover cases for working with images, right? As of today, is it supposed to work with other data too? If so, it'd make sense to add some more tests with other data. Otherwise, I'd suggest to restrict the input formats to images for now, to avoid negative user experiences:

Suggested change
<param name="file1" type="data" format="data" label="First file"/>
<param name="file2" type="data" format="data" label="Second file"/>
<param name="file1" type="data" format="tiff,png" label="First file"/>
<param name="file2" type="data" format="tiff,png" label="Second file"/>

Same also in lines 63, 66, 69, in tools/iscc-sum/iscc_verify.xml, and tools/iscc-sum/iscc_sum.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should work for any data. I have a fasta file in the test data, can make sure it is used in more different tests.

<param name="file1" value="test1.png"/>
<param name="file2" value="test1.png"/>
</conditional>
<param name="threshold" value="12"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tighter testing, wouldn't it make sense to set this threshold to 0, since the two inputs are identical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ISCC-SUM CLI tool the default threshold for similarity is set to 12. Only when files are similar it will return their similarity. As we discussed below I will revise the output. I think it should always output the similarity score for all files

<param name="threshold" value="12"/>
<output name="output_file">
<assert_contents>
<has_text text="~00"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of the tool for this test case is:

ISCC:K4AOMGOGQJA4Y46PAC4YPPA63GKD5RVFPR7FU3I4OOEW44TYXNYOTMY *input_files/test1
ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2
  ~00 ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2 (1)

We can make the tests a bit stricter:

Suggested change
<has_text text="~00"/>
<has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file1$"/>
<has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file2$"/>
<has_line_matching expression="^ +\~00 ISCC:[A-Z0-9]+ \*input_files/file3$"/>

I guess, something similar can be done for lines 94 and 137.

Comment on lines +59 to +72
<conditional name="input_type">
<param name="input_selector" type="select" label="Input type">
<option value="single">Single file</option>
<option value="collection">Collection of files</option>
</param>
<when value="single">
<param name="input_file" type="data" format="data" label="File to verify"/>
</when>
<when value="collection">
<param name="input_collection" type="data_collection" collection_type="list" format="data"
label="Collection to verify"
help="Collection of files - will generate combined ISCC hash"/>
</when>
</conditional>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little follow-up on #162 (comment).

If you're open for further simplification, you can actually use a single input, like

Suggested change
<conditional name="input_type">
<param name="input_selector" type="select" label="Input type">
<option value="single">Single file</option>
<option value="collection">Collection of files</option>
</param>
<when value="single">
<param name="input_file" type="data" format="data" label="File to verify"/>
</when>
<when value="collection">
<param name="input_collection" type="data_collection" collection_type="list" format="data"
label="Collection to verify"
help="Collection of files - will generate combined ISCC hash"/>
</when>
</conditional>
<param name="input" type="data" format="data" multiple="true" label="Dataset(s) to verify"/>

and completely avoid the conditional (docs).

This allows using a single dataset, multiple datasets, or a collection:

Image

Comment on lines +43 to +60
<conditional name="input_type">
<param name="input_selector" type="select" label="Input type">
<option value="single">Single file</option>
<option value="collection">Collection of files</option>
</param>
<when value="single">
<param name="input_file" type="data" format="data" label="Input File"
help="Any file type - ISCC-SUM will generate a checksum and similarity hash"/>
</when>
<when value="collection">
<param name="input_collection" type="data_collection" collection_type="list" format="data"
label="Input File Collection"
help="A collection of files - ISCC-SUM will generate checksum and similarity hash"/>
<param name="calculate_individual" type="boolean" truevalue="true" falsevalue="false" checked="true"
label="Calculate ISCC hash for each file individually"
help="If selected, generates one ISCC hash per file. If not selected, calculates a single ISCC hash for all files combined."/>
</when>
</conditional>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment https://github.com/BMCV/galaxy-image-analysis/pull/162/files#r2580605041 on tools/iscc-sum/iscc_verify.xml lines 59–72

**What it does**
Generates an International Standard Content Code (ISCC) based checksum and similarity hash from any input file.
Generates an International Standard Content Code (ISCC) based checksum and similarity hash from a single file or collection of files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd adapt the terminology of "files" to "datasets" that are more common in Galaxy.

<help><![CDATA[
**What it does**
Verifies that a file or collection matches an expected ISCC hash for exact checksum verification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Verifies that a file or collection matches an expected ISCC hash for exact checksum verification.
Verifies that a file or collection matches an expected International Standard Content Code (ISCC) hash for exact checksum verification.

@kostrykin
Copy link
Member

Also, for the iscc_similarity tool, the output looks not very easy to work with in subsequent processing, does it it? In Galaxy, many tools can cope with tabular files (aka TSV, like CSV, but with tab stops for delimiters). Would it be difficult to change the output format to something tabular?

@maartenpaul
Copy link
Contributor Author

kostrykin Thanks for your comments. Will try to work on those. For the output of the ISCC similarity tool, I agree that it is not very convenient. I guess we should limit the input to comparing one-to-one and one-to-many to avoid complex output. Then we can put it in a table.
One of the issues with the output is that Galaxy does internally not use the original filename, but a random (?) name. What would be the best way to identify the files by their name in the output? So the output could be used in a workflow easily.

@kostrykin
Copy link
Member

One of the issues with the output is that Galaxy does internally not use the original filename, but a random (?) name. What would be the best way to identify the files by their name in the output? So the output could be used in a workflow easily.

Good question. If the input is a list of datasets, then you might identify the datasets/files by their position in the collection.

For collections in general, there also is the .element_identifier that can be used in the Cheetah code of your tool. I think that there also is .id that you can use for input datasets in the Cheetah code, but I can't find any docs on that right now.

Maybe @bgruening can add something on this?

@bgruening
Copy link
Collaborator

element_identifier is the correct thing to do here I think. In the non-collection case element_identifier will fallback to the dataset name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants