Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
609362e
Add ISCC-SUM tools for similarity comparison and checksum verification
maartenpaul Nov 6, 2025
554e346
Enhance ISCC checksum verification tool to support expected code inpu…
maartenpaul Nov 6, 2025
3dbf198
Rename tool to "Verify ISCC hash" and update test input parameter for…
maartenpaul Nov 6, 2025
757be8a
Rename tool to "Compare ISCC hash similarity" and update test asserti…
maartenpaul Nov 6, 2025
402ef65
Add ISCC hash to test data file for checksum verification
maartenpaul Nov 6, 2025
3eccef8
Update tool version to 0.1.1 in macros.xml
maartenpaul Nov 6, 2025
5c00880
Update tool version to 0.1.1 in macros.xml
maartenpaul Nov 6, 2025
e2e23e6
Merge branch 'iscc-tools' of https://github.com/maartenpaul/galaxy-im…
maartenpaul Nov 6, 2025
66e3119
add extra test files
maartenpaul Nov 7, 2025
9f14bed
More consistent naming of ISCC hash
maartenpaul Nov 12, 2025
0b99e9d
Add support for single file and collection input modes in ISCC-SUM tool
maartenpaul Nov 28, 2025
0a2a791
Restructure the ISCC tools to serve different purposes
maartenpaul Nov 28, 2025
57e50ac
Update README for dataset similarity and combined mode
maartenpaul Nov 28, 2025
11487b0
resolve issues with markdown in
maartenpaul Nov 28, 2025
06908a0
Merge branch 'iscc-tools' of https://github.com/maartenpaul/galaxy-im…
maartenpaul Nov 28, 2025
0d90677
update readme
maartenpaul Nov 28, 2025
29619f8
update README
maartenpaul Nov 28, 2025
1863f57
Improve error handling in ISCC hash verification
maartenpaul Nov 28, 2025
916fc64
Fix formatting in ISCC similarity tool documentation
maartenpaul Nov 28, 2025
aea916f
Fix formatting in ISCC-SUM documentation for clarity
maartenpaul Nov 28, 2025
5ed6929
Refactor expected ISCC hash input handling and improve documentation …
maartenpaul Nov 28, 2025
136ac4f
Update tools/iscc-sum/iscc_similarity.xml
maartenpaul Dec 2, 2025
c6fdf72
Update tools/iscc-sum/iscc_similarity.xml
maartenpaul Dec 3, 2025
1a7cfce
Update tools/iscc-sum/iscc_similarity.xml
maartenpaul Dec 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions tools/iscc-sum/iscc_similarity.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
<tool id="iscc_sum_compare" name="Compare ISCC hash similarity" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="24.1">
<description>with ISCC-SUM</description>
<macros>
<import>macros.xml</import>
<import>creators.xml</import>
</macros>
<expand macro="requirements" />
<expand macro="version_command" />
<creator>
<expand macro="creators/iscc" />
<expand macro="creators/lco" />
<expand macro="creators/maartenpaul" />
<expand macro="creators/etzm" />
</creator>

<command detect_errors="exit_code"><![CDATA[
#if $input_type.input_selector == "two_files":
## Pairwise comparison
iscc-sum --similar --threshold $threshold '$input_type.file1' '$input_type.file2' > '${output_file}'
#else:
## Collection comparison
mkdir -p input_files &&
#for $file in $input_type.file_collection:
ln -s '$file' 'input_files/$file.element_identifier' &&
#end for
iscc-sum --similar --threshold $threshold input_files/* > '${output_file}'
#end if
]]></command>

<inputs>
<conditional name="input_type">
<param name="input_selector" type="select" label="Input type">
<option value="two_files">Compare two files</option>
<option value="collection">Find similar files in collection</option>
</param>
<when value="two_files">
<param name="file1" type="data" format="data" label="First file"/>
<param name="file2" type="data" format="data" label="Second file"/>
Comment on lines +58 to +59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests only cover cases for working with images, right? As of today, is it supposed to work with other data too? If so, it'd make sense to add some more tests with other data. Otherwise, I'd suggest to restrict the input formats to images for now, to avoid negative user experiences:

Suggested change
<param name="file1" type="data" format="data" label="First file"/>
<param name="file2" type="data" format="data" label="Second file"/>
<param name="file1" type="data" format="tiff,png" label="First file"/>
<param name="file2" type="data" format="tiff,png" label="Second file"/>

Same also in lines 63, 66, 69, in tools/iscc-sum/iscc_verify.xml, and tools/iscc-sum/iscc_sum.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should work for any data. I have a fasta file in the test data, can make sure it is used in more different tests.

</when>
<when value="collection">
<param name="file_collection" type="data_collection" collection_type="list"
format="data" label="File collection to compare"/>
</when>
</conditional>
<param name="threshold" type="integer" value="12" min="0" max="256"
label="Similarity threshold (Hamming distance)"
help="Maximum Hamming distance for similarity matching. 0-5: Nearly identical, 6-12: Likely similar (default), 13-20: Probably somewhat similar"/>
</inputs>

<outputs>
<data name="output_file" format="txt" label="${tool.name} on ${on_string}"/>
</outputs>

<tests>
<!-- Test pairwise comparison -->
<test expect_num_outputs="1">
<conditional name="input_type">
<param name="input_selector" value="two_files"/>
<param name="file1" value="test1.png"/>
<param name="file2" value="test1.png"/>
</conditional>
<param name="threshold" value="12"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tighter testing, wouldn't it make sense to set this threshold to 0, since the two inputs are identical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ISCC-SUM CLI tool the default threshold for similarity is set to 12. Only when files are similar it will return their similarity. As we discussed below I will revise the output. I think it should always output the similarity score for all files

<output name="output_file">
<assert_contents>
<has_text text="~00"/>
</assert_contents>
</output>
</test>
<!-- Test collection comparison -->
<test expect_num_outputs="1">
<conditional name="input_type">
<param name="input_selector" value="collection"/>
<param name="file_collection">
<collection type="list">
<element name="file1" value="test1.png"/>
<element name="file2" value="test2.tiff"/>
<element name="file3" value="test2.tiff"/>
</collection>
</param>
</conditional>
<param name="threshold" value="12"/>
<output name="output_file">
<assert_contents>
<has_text text="~00"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of the tool for this test case is:

ISCC:K4AOMGOGQJA4Y46PAC4YPPA63GKD5RVFPR7FU3I4OOEW44TYXNYOTMY *input_files/test1
ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2
  ~00 ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2 (1)

We can make the tests a bit stricter:

Suggested change
<has_text text="~00"/>
<has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file1$"/>
<has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file2$"/>
<has_line_matching expression="^ +\~00 ISCC:[A-Z0-9]+ \*input_files/file3$"/>

I guess, something similar can be done for lines 94 and 137.

</assert_contents>
</output>
</test>
</tests>

<help><![CDATA[
**What it does**

Compares files by their ISCC codes to detect content similarity.

This tool can operate in two modes:

1. **Pairwise comparison**: Compare two specific files to determine their similarity
2. **Collection comparison**: Find all similar files within a collection

**Similarity Measurement**

Similarity is measured using Hamming distance on the Data-Code component of ISCC codes.
Lower numbers indicate higher similarity:

- **0-5**: Nearly identical files
- **6-12**: Likely similar content (default threshold)
- **13-20**: Probably somewhat similar

**Input**

Either:
- Two individual files for pairwise comparison
- A dataset collection containing multiple files

**Output**

A text report showing similarity relationships. Files are grouped by similarity, with
reference files listed first and similar files indented below with their Hamming distance (~N).

Example output::

document_v1.txt
~08 document_v2.txt
~12 document_draft.txt
Comment on lines +177 to +179
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by what I observed with planemo serve, I think the output rather looks like this, doesn't it?

Suggested change
document_v1.txt
~08 document_v2.txt
~12 document_draft.txt
ISCC:K4A... *input_files/document_v1
~08 ISCC:K4A... *input_files/document_v2
~12 ISCC:K4A... *input_files/document_draft


**More Information**

For more details about ISCC, visit: https://iscc.codes/
]]></help>
<expand macro="citations" />
</tool>
147 changes: 147 additions & 0 deletions tools/iscc-sum/iscc_verify.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
<tool id="iscc_sum_verify" name="Verify ISCC hash" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="24.1">
<description>with ISCC-SUM</description>
<macros>
<import>macros.xml</import>
<import>creators.xml</import>
</macros>
<expand macro="requirements" />
<expand macro="version_command" />
<creator>
<expand macro="creators/iscc" />
<expand macro="creators/lco" />
<expand macro="creators/maartenpaul" />
<expand macro="creators/etzm" />
</creator>

<command detect_errors="exit_code"><![CDATA[
GENERATED=\$(iscc-sum '$input_file' | cut -d':' -f2 | cut -d' ' -f1) &&
#if $expected_code_source.source_type == "text":
EXPECTED='$expected_code_source.expected_code_text'
#else:
## Read file and remove all whitespace including newlines
EXPECTED=\$(cat '$expected_code_source.expected_code_file' | head -n 1 | tr -d '[:space:]')
#end if
&&
echo "File: $input_file.element_identifier" > '${output_file}' &&
echo "Expected: \$EXPECTED" >> '${output_file}' &&
echo "Generated: \$GENERATED" >> '${output_file}' &&
if [ "\$GENERATED" = "\$EXPECTED" ]; then
echo "Status: OK" >> '${output_file}';
else
echo "Status: FAILED" >> '${output_file}';
fi
]]></command>

<inputs>
<param name="input_file" type="data" format="data" label="File to verify"/>
<conditional name="expected_code_source">
<param name="source_type" type="select" label="Source of expected ISCC code">
<option value="text">Enter manually</option>
<option value="file">From file (workflow input)</option>
</param>
<when value="text">
<param name="expected_code_text" type="text" label="Expected ISCC code"
help="The 55-character ISCC-SUM code to verify against">
<validator type="length" min="55" max="55" message="ISCC code must be exactly 55 characters"/>
</param>
</when>
<when value="file">
<param name="expected_code_file" type="data" format="txt" label="File containing expected ISCC code"
help="Text file containing the ISCC code from a previous step"/>
</when>
</conditional>
</inputs>

<outputs>
<data name="output_file" format="txt" label="${tool.name} on ${on_string}"/>
</outputs>

<tests>
<!-- Test verification with text input - match -->
<test expect_num_outputs="1">
<param name="input_file" value="test1.png"/>
<conditional name="expected_code_source">
<param name="source_type" value="text"/>
<param name="expected_code_text" value="K4AOMGOGQJA4Y46PAC4YPPA63GKD5RVFPR7FU3I4OOEW44TYXNYOTMY"/>
</conditional>
<output name="output_file">
<assert_contents>
<has_text text="Status: OK"/>
</assert_contents>
</output>
</test>
<!-- Test verification with text input - mismatch -->
<test expect_num_outputs="1">
<param name="input_file" value="test1.png"/>
<conditional name="expected_code_source">
<param name="source_type" value="text"/>
<param name="expected_code_text" value="K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ"/>
</conditional>
<output name="output_file">
<assert_contents>
<has_text text="Status: FAILED"/>
</assert_contents>
</output>
</test>
<!-- Test verification with file input -->
<test expect_num_outputs="1">
<param name="input_file" value="test1.png"/>
<conditional name="expected_code_source">
<param name="source_type" value="file"/>
<param name="expected_code_file" value="test1_iscc.txt"/>
</conditional>
<output name="output_file">
<assert_contents>
<has_text text="Status: OK"/>
</assert_contents>
</output>
</test>
</tests>

<help><![CDATA[
**What it does**

Verifies that a file matches an expected ISCC code for checksum verification.

This tool generates an ISCC code for the input file and compares it against
the provided expected code. It reports whether the codes match (OK) or don't match (FAILED).

**Use Cases**

- Verify file integrity after transfer or storage
- Confirm you have the correct version of a file
- Validate that a file hasn't been modified
- Workflow verification: Generate ISCC code in one step, verify in another

**Input**

- A file to verify
- The expected ISCC-SUM code, either:
- Entered manually as text (55-character string)
- From a file output from a previous workflow step

**Output**

A verification report showing:

- The filename
- Expected ISCC code
- Generated ISCC code from the file
- Status: OK (codes match) or FAILED (codes don't match)

**Workflow Usage**

In a workflow, you can:

1. Use "Generate ISCC hash" tool to create an ISCC code for a reference file
2. Connect that output to this tool's "expected code file" input
3. Provide a test file to verify

This allows automated verification within Galaxy workflows.

**More Information**

For more details about ISCC, visit: https://iscc.codes/
]]></help>
<expand macro="citations" />
</tool>
2 changes: 1 addition & 1 deletion tools/iscc-sum/macros.xml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<macros>
<token name="@TOOL_VERSION@">0.1.0</token>
<token name="@VERSION_SUFFIX@">0</token>
<token name="@VERSION_SUFFIX@">1</token>
<xml name="citations">
<citations>
<citation type="bibtex">
Expand Down
1 change: 1 addition & 0 deletions tools/iscc-sum/test-data/test1_iscc.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
K4AOMGOGQJA4Y46PAC4YPPA63GKD5RVFPR7FU3I4OOEW44TYXNYOTMY
Binary file added tools/iscc-sum/test-data/test2_8bit.tif
Binary file not shown.
Binary file added tools/iscc-sum/test-data/test2_black.tif
Binary file not shown.
Binary file added tools/iscc-sum/test-data/test2_contrast.tif
Binary file not shown.
58 changes: 58 additions & 0 deletions tools/iscc-sum/test-data/test3_mutated.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
>sp|P51587|BRCA2_HUMAN Breast cancer type 2 susceptibility protein OS=Homo sapiens OX=9606 GN=BRCA2 PE=1 SV=4
MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPN
LFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTV
KTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHI
SESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESL
KKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVV
DTSEEDSFSLCFSKCRTKNLQKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEP
NDTDPLDSNVANQKPFESGSDKISKEVVPSLACEWSQLTLSGLNGAQMEKIPLLHISSCD
QNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDC
ILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGL
EIHTVCSQKEDSLCPNLIDNGSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSY
KGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLHSSVKRSCSQNDSEEPTLSLT
SSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCEND
PKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPTSKDVLS
NLVMISRGKESYKMSDKLKGNNYESDVELTKNIPMEKNQDVCALNENYKNVELLPPEKYM
RVASPSRKVQFNQNTNLRVIQKNQEETTSISKITVNPDSEELFSDNENNFVFQVANERNN
LALGNTKELHETDLTCVNEPIFKNSTMVLYGDTGDKQDTQVSIKKDLVYVLAEENKNSVK
QHIKMTLGQDLKSDISLNIDKIPEKNNDYMNKWAGLLGPISNHSFGGSFRTASNKEIKLS
EHNIKKSKMFFKDIEEQYPTSLACVEIVNTLALDNQKKLSKPQSINTVSAHLQSSVVVSD
CKNSHITPQMLFSKQDFNSNHNLTPSQKAEITELSTILEESGSQFEFTQFRKPSYILQKS
TFEVPENQMTILKTTSEECRDADLHVIMNAPSIGQVDSSKQFEGTVEIKRKFAGLLKNDC
NKSASGYLTDENEVGFRGFYSAHGTKLNVSTEALQKAVKLFSDIENISEETSAEVHPISL
SSSKCHDSVVSMFKIENHNDKTVSEKNNKCQLILQNNIEMTTGTFVEEITENYKRNTENE
DNKYTAASRNSHNLEFDGSDSSKNDTVCIHKDETDLLFTDQHNICLKLSGQFMKEGNTQI
KEDLSDLTFLEVAKAQEACHGNTSNKEQLTATKTEQNIKDFETSDTFFQTASGKNISVAK
ESFNKIVNFFDQKPEELHNFSLNSELHSDIRKNKMDILSYEETDIVKHKILKESVPVGTG
NQLVTFQGQPERDEKIKEPTLLGFHTASGKKVKIAKESLDKVKNLFDEKEQGTSEITSFS
HQWAKTLKYREACKDLELACETIEITAAPKCKEMQNSLNNDKNLVSIETVVPPKLLSDNL
CRQTENLKTSKSIFLKVKVHENVEKETAKSPATCYTNQSPYSVIENSALAFYTSCSRKTS
VSQTSLLEAKKWLREGIFDGQPERINTADYVGNYLYENNSNSTIAENDKNHLSEKQDTYL
SNSSMSNSYSYHSDEVYNDSGYLSKNKLDSGIEPVLKNVEDQKNTSFSKVISNVKDANAY
PQTVNEDICVEELVTSSSPCKNKNAAIKLSISNSNNFEVGPPAFRIASGKIVCVSHETIK
KVKDIFTDSFSKVIKENNENKSKICQTKIMAGCYEALDDSEDILHNSLDNDECSTHSHKV
FADIQSEEILQHNQNMSGLEKVSKISPCDVSLETSDICKCSIGKLHKSVSSANTCGIFST
ASGKSVQVSDASLQNARQVFSEIEDSTKQVFSKVLFKSNEHSDQLTREENTAIRTPEHLI
SQKGFSYNVVNSSAFSGFSTASGKQVSILESSLHKVKGVLEEFDLIRTEHSLHYSPTSRQ
NVSKILPRVDKRNPEHCVNSEMEKTCSKEFKLSNNLNVEGGSSENNHSIKVSPYLSQFQQ
DKQQLVLGTKVSLVENIHVLGKEQASPKNVKMEIGKTETFSDVPVKTNIEVCSTYSKDSE
NYFETEAVEIAKAFMEDDELTDSKLPSHATHSLFTCPENEEMVLSNSRIGKRRGEPLILV
GEPSIKRNLLNEFDRIIENQEKSLKASKSTPDGTIKDRRLFMHHVSLEPITCVPFRTTKE
RQEIQNPNFTAPGQEFLSKSHLYEHLTLEKSSSNLAVSGHPFYQVSATRNEKMRHLITTG
RPTKVFVPPFKTKSHFHRVEQCVRNINLEENRQKQNIDGHGSDDSKNKINDNEIHQFNKN
NSNQAVAVTFTKCEEEPLDLITSLQNARDIQDMRIKKKQRQRVFPQPGSLYLAKTSTLPR
ISLKAAVGGQVPSACSHKQLYTYGVSKHCIKINSKNAESFQFHTEDYFGKESLWTGKGIQ
LADGGWLIPSNDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPK
EFANRCLSPERVLLQLKYRYDTEIDRSRRSAIKKIMERDDTAAKTLVLCVSDIISLSANI
SETSSNKTSSADTQKVAIIELTDGWYAVKAQLDPPLLAVLKNGRLTVGQKIILHGAELVG
SPDACTPLEAPESLMLKISANSTRPARWYTKLGFFPDPRPFPLPLSSLFSDGGNVGCVDV
IIQRAYPIQWMEKTSSGLYIFRNEREEFKEAAKYVEAQQKRLEALFTKIQEEFEEHEENT
TKPYLPSRALTRQQVRALQDGAELYEAVKNAADPAYLEGYFSEEQLRALNNHRQMLNDKK
QAQIQLEIRKAMESAEQKEQGLSRDVTTVWKLRIVSYSKKEKDSVILSIWRPSSDLYSLL
TEGKRYRIYHLATSKSKSKSERANIQLAATKKTQYQQLPVSDEILFQIYQPREPLHFSKF
LDPDFQPSCSEVDLIGFVVSVVKKTGLAPFVYLSDECYNLLAIKFWIDLNEDIIKPHMLI
AASNLQWRPESKSGLLTLFAGDFSVFSASPKEGHFQETFNKMKNTVENIDILCNEAENKL
MHILHANDPKWSTPTKDCTSGPYTAQIIPGTGNKLLMSSPNCEIYYQSPLSLCMAKRKSV
STPVSAQMTSKSCKGEKEIDDQKNCKKRRASDFLSRLPLPPPVSPICTFVSPAAQKAFQP
PRSCGTKYETPIKKKELNSPQMTPFKKFNEISLLESNSIADEELALINTQALLSGSTGEK
QFISVSESTRTAPTSSEDYLRLKRRCTTSLIKEQESSQASTEECEKNKQDTITTKKYI