BMCV · maartenpaul · Nov 6, 2025 · Nov 6, 2025 · Nov 6, 2025 · Nov 6, 2025
diff --git a/tools/iscc-sum/README.md b/tools/iscc-sum/README.md
@@ -0,0 +1,156 @@
+# ISCC-SUM Tools for Galaxy
+
+A suite of Galaxy tools for generating, verifying, and comparing ISCC (International Standard Content Code) hashes. ISCC-SUM provides content-derived identifiers for file integrity verification and similarity detection.
+
+## Tools Overview
+
+### 1. Generate ISCC hash (`iscc_sum.xml`)
+**Purpose**: Create reference ISCC hashes for files
+
+**Modes**:
+- **Single file**: Generate one ISCC hash
+- **Collection (individual)**: Generate hash per file with element identifiers
+- **Collection (combined)**: Generate single hash for entire collection
+
+**Use when**: You need to create reference hashes for later verification or comparison
+
+---
+
+### 2. Verify ISCC hash (`iscc_verify.xml`)
+**Purpose**: Exact match verification - check if files are identical
+
+**Modes**:
+- **Single file**: Verify one file against expected hash
+- **Collection**: Verify entire collection as one unit (generates combined ISCC)
+
+**Use when**: You need to confirm files are EXACTLY the same (bit-for-bit)
+
+**Note**: Even minor modifications will cause verification to FAIL
+
+---
+
+### 3. Compare ISCC similarity (`iscc_similarity.xml`)
+**Purpose**: Content similarity detection - find related/modified files
+
+**Modes**:
+- **Two files**: Compare two specific files
+- **Collection**: Find all similar files within a collection
+- **Two collections**: Compare two entire collections as units (generates combined ISCC for each)
+
+**Use when**: You want to detect:
+- Near-duplicates (minor edits, format changes)
+- Different versions of same content
+- Related files (cropped images, edited documents)
+- Whether two entire datasets are similar (even if not exactly identical)
+
+**Key feature**: Works especially well on **large datasets** - ISCC-SUM is optimized for detecting similarity across many files efficiently
+
+---
+
+## Usage Scenarios
+
+### Scenario 1: Quick Collection Integrity Check
+**Goal**: Quickly verify if an entire collection has changed
+
+```
+Workflow:
+1. Generate ISCC (combined mode) → Input: Reference collection (100 files)
+                                   Output: Single ISCC hash
+
+2. [Transfer/storage/time passes]
+
+3. Verify ISCC → Input: New collection (100 files) + reference hash
+                 Output: "Status: OK" or "Status: FAILED"
+```
+
+**Result**: You know instantly if the collection as a whole has changed (but not which specific files)
+
+---
+
+### Scenario 2: Dataset Similarity When Verification Fails
+**Goal**: Check if a large dataset matches reference exactly, and if not, how similar it is
+
+**Why this is important**: Sometimes you receive a dataset that's been modified but want to know if it's still what you expect it to be or how much it differs from the reference.
+
+```
+Workflow:
+1. Generate ISCC (combined mode) → Input: Reference collection (100 files)
+                                   Output: Single ISCC hash
+
+2. [Transfer/storage/time passes - dataset may have changed]
+
+3. Verify ISCC → Input: New collection (100 files) + reference hash
+                 Output: "Status: FAILED - Hashes do not match"
+
+4. Compare similarity (two collections) → Input: Reference collection + New collection
+                                          Threshold: 12
+                                          Output: "Similarity: ~08 (Very similar, minor changes)"
+```
+
+**Result**:
+- You know the dataset is NOT identical (verification failed)
+- You know it's still very similar (Hamming distance = 8)
+- You can make an informed decision: Is it similar enough to use? Should you investigate the differences?
+
+**Use cases**:
+- Receiving scientific datasets from collaborators
+- Verifying backups that may have compression/format changes
+- Checking if processed data still matches original closely enough
+- Quality control where "close enough" is acceptable
+
+---
+
+### Scenario 3: Duplicate Detection in Large Dataset
+**Goal**: Find duplicate and near-duplicate files in large collection
+
+**Why ISCC-SUM excels here**: Traditional hash functions (MD5, SHA) only detect exact duplicates. ISCC-SUM detects **content similarity**, making it ideal for:
+- Image collections (same image with edits)
+- Document repositories (different versions, formats)
+- Genomic data (similar sequences with variations)
+
+```
+Workflow:
+1. Compare similarity → Input: Large collection (1000+ files)
+                        Threshold: 12 (configurable)
+                        Output: Groups of similar files
+
+Example output:
+  reference_image_001.png
+    ~00 duplicate_001.jpg (exact duplicate, different format)
+    ~08 edited_001.png (minor edits)
+
+  document_v1.txt
+    ~05 document_v2.txt (very similar)
+    ~12 document_draft.txt (moderate differences)
+```
+
+**Result**: Identify redundant files, track versions, find related content
+
+---
+
+### Scenario 4: Quality Control Pipeline
+**Goal**: Automated verification in data processing pipeline
+
+```
+Galaxy Workflow:
+1. [Data arrives] → Collection of files
+
+2. Generate ISCC (combined mode) → Create hash for collection
+
+3. Verify ISCC → Compare against expected reference
+                 ↓
+    PASS: Continue workflow
+    FAIL: → 4. Compare similarity (two collections)
+              ↓
+              Report similarity score (~08 = very similar, ~48 = very different)
+              Decide: Accept if similar enough, or reject and investigate
+```
+
+**Result**: Automated QC catches data integrity issues and helps determine if differences are acceptable
+
+---
+
+## More Information
+
+- ISCC specification: https://iscc.codes/
+- ISCC-SUM GitHub: https://github.com/iscc/iscc-sum
diff --git a/tools/iscc-sum/iscc_similarity.xml b/tools/iscc-sum/iscc_similarity.xml
@@ -0,0 +1,213 @@
+<tool id="iscc_sum_compare" name="Compare ISCC hash similarity" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="24.1">
+    <description>with ISCC-SUM</description>
+    <macros>
+        <import>macros.xml</import>
+        <import>creators.xml</import>
+    </macros>
+    <expand macro="requirements" />
+    <expand macro="version_command" />
+    <creator>
+        <expand macro="creators/iscc" />
+        <expand macro="creators/lco" />
+        <expand macro="creators/maartenpaul" />
+        <expand macro="creators/etzm" />
+    </creator>
+
+    <command detect_errors="exit_code"><![CDATA[
+        #if $input_type.input_selector == "two_files":
+            ## Pairwise file comparison
+            iscc-sum --similar --threshold $threshold '$input_type.file1' '$input_type.file2' > '${output_file}'
+        #elif $input_type.input_selector == "two_collections":
+            ## Two collections comparison - generate combined ISCC for each
+            ISCC1=\$(cat
+            #for $file in $input_type.collection1:
+                '$file'
+            #end for
+            | iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) &&
+
+            ISCC2=\$(cat
+            #for $file in $input_type.collection2:
+                '$file'
+            #end for
+            | iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) &&
+
+            ## Create temp files for comparison
+            echo \$ISCC1 > temp1.txt &&
+            echo \$ISCC2 > temp2.txt &&
+
+            ## Use iscc-sum to compare similarity and output directly
+            iscc-sum --similar --threshold $threshold temp1.txt temp2.txt > '${output_file}' || true
+        #else:
+            ## Single collection - find similar files within
+            mkdir -p input_files &&
+            #for $file in $input_type.file_collection:
+                ln -s '$file' 'input_files/$file.element_identifier' &&
+            #end for
+            iscc-sum --similar --threshold $threshold input_files/* > '${output_file}'
+        #end if
+    ]]></command>
+
+    <inputs>
+        <conditional name="input_type">
+            <param name="input_selector" type="select" label="Input type">
+                <option value="two_files">Compare two datasets</option>
+                <option value="collection">Find similar datasets in collection</option>
+                <option value="two_collections">Compare two collections</option>
+            </param>
+            <when value="two_files">
+                <param name="file1" type="data" format="data" label="First file"/>
+                <param name="file2" type="data" format="data" label="Second file"/>
-                <param name="file1" type="data" format="data" label="First file"/>
-                <param name="file2" type="data" format="data" label="Second file"/>
+                <param name="file1" type="data" format="tiff,png" label="First file"/>
+                <param name="file2" type="data" format="tiff,png" label="Second file"/>
-                <param name="file1" type="data" format="data" label="First file"/>
-                <param name="file2" type="data" format="data" label="Second file"/>
+                <param name="file1" type="data" format="tiff,png" label="First file"/>
+                <param name="file2" type="data" format="tiff,png" label="Second file"/>
+            </when>
+            <when value="collection">
+                <param name="file_collection" type="data_collection" collection_type="list"
+                       format="data" label="File collection to compare"/>
+            </when>
+            <when value="two_collections">
+                <param name="collection1" type="data_collection" collection_type="list" format="data"
+                       label="First collection (reference)"
+                       help="Reference collection - will generate combined ISCC"/>
+                <param name="collection2" type="data_collection" collection_type="list" format="data"
+                       label="Second collection (to compare)"
+                       help="Collection to compare against reference - will generate combined ISCC"/>
+            </when>
+        </conditional>
+        <param name="threshold" type="integer" value="12" min="0" max="256"
+               label="Similarity threshold (Hamming distance)"
+               help="Maximum Hamming distance for similarity matching. 0-5: Nearly identical, 6-12: Likely similar (default), 13-20: Probably somewhat similar"/>
+    </inputs>
+
+    <outputs>
+        <data name="output_file" format="txt" label="${tool.name} on ${on_string}"/>
+    </outputs>
+
+    <tests>
+        <!-- Test 1: Pairwise file comparison -->
+        <test expect_num_outputs="1">
+            <conditional name="input_type">
+                <param name="input_selector" value="two_files"/>
+                <param name="file1" value="test1.png"/>
+                <param name="file2" value="test1.png"/>
+            </conditional>
+            <param name="threshold" value="12"/>
+            <output name="output_file">
+                <assert_contents>
+                    <has_text text="~00"/>
+                </assert_contents>
+            </output>
+        </test>
+        <!-- Test 2: Single collection comparison -->
+        <test expect_num_outputs="1">
+            <conditional name="input_type">
+                <param name="input_selector" value="collection"/>
+                <param name="file_collection">
+                    <collection type="list">
+                        <element name="file1" value="test1.png"/>
+                        <element name="file2" value="test2.tiff"/>
+                        <element name="file3" value="test2.tiff"/>
+                    </collection>
+                </param>
+            </conditional>
+            <param name="threshold" value="12"/>
+            <output name="output_file">
+                <assert_contents>
+                    <has_text text="~00"/>
-                    <has_text text="~00"/>
+                    <has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file1$"/>
+                    <has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file2$"/>
+                    <has_line_matching expression="^ +\~00 ISCC:[A-Z0-9]+ \*input_files/file3$"/>
-                    <has_text text="~00"/>
+                    <has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file1$"/>
+                    <has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file2$"/>
+                    <has_line_matching expression="^ +\~00 ISCC:[A-Z0-9]+ \*input_files/file3$"/>
+                </assert_contents>
+            </output>
+        </test>
+        <!-- Test 3: Two collections comparison -->
+        <test expect_num_outputs="1">
+            <conditional name="input_type">
+                <param name="input_selector" value="two_collections"/>
+                <param name="collection1">
+                    <collection type="list">
+                        <element name="sample1" value="test1.png"/>
+                        <element name="sample2" value="test2.tiff"/>
+                    </collection>
+                </param>
+                <param name="collection2">
+                    <collection type="list">
+                        <element name="sample1" value="test1.png"/>
+                        <element name="sample2" value="test2.tiff"/>
+                    </collection>
+                </param>
+            </conditional>
+            <param name="threshold" value="12"/>
+            <output name="output_file">
+                <assert_contents>
+                    <has_text text="~00"/>
+                </assert_contents>
+            </output>
+        </test>
+    </tests>
+
+    <help><![CDATA[
+**What it does**
+
+Compares files or collections by their International Standard Content Code (ISCC) hashes to detect content similarity.
+
+This tool can operate in three modes:
+
+1. **Pairwise file comparison**: Compare two specific files
+2. **Collection analysis**: Find all similar files within a collection
+3. **Two collections comparison**: Compare two complete collections as units
+
+**Similarity Measurement**
+
+Similarity is measured using Hamming distance on the Data-Code component of ISCC codes.
+Lower numbers indicate higher similarity:
+
+- **0-5**: Nearly identical files
+- **6-12**: Likely similar content (default threshold)
+- **13-20**: Probably somewhat similar
+
+**Input**
+
+Choose one of:
+
+- **Two files**: Individual files for pairwise comparison
+- **One collection**: Find similar files within the collection
+- **Two collections**: Compare collections as complete units
+
+**Output**
+
+**Mode 1 & 2** (Files/Collection): Similarity relationships showing reference files and similar matches with Hamming distance (~N)
+
+Example::
+
+    document_v1.txt
+      ~08 document_v2.txt
+      ~12 document_draft.txt
-    document_v1.txt
-      ~08 document_v2.txt
-      ~12 document_draft.txt
+    ISCC:K4A... *input_files/document_v1
+      ~08 ISCC:K4A... *input_files/document_v2
+      ~12 ISCC:K4A... *input_files/document_draft
-    document_v1.txt
-      ~08 document_v2.txt
-      ~12 document_draft.txt
+    ISCC:K4A... *input_files/document_v1
+      ~08 ISCC:K4A... *input_files/document_v2
+      ~12 ISCC:K4A... *input_files/document_draft
+
+**Mode 3** (Two collections): Similarity comparison of combined ISCCs
+
+Example::
+
+    ISCC:K4A... *temp1.txt
+      ~08 ISCC:K4A... *temp2.txt
+
+**Use Cases**
+
+- **Mode 1**: Compare two specific files for similarity
+- **Mode 2**: Find duplicates/near-duplicates in large dataset (ISCC-SUM excels here!)
+- **Mode 3**: Check if two collections are similar (but not exactly the same)
+
+**Workflow: Verify then Compare**
+
+Step 1: `Verify ISCC hash`_
+
+.. _Verify ISCC hash: ?tool_id=toolshed.g2.bx.psu.edu/repos/imgteam/iscc_sum/iscc_sum_verify
+  Input: New collection + reference hash
+  Output: FAILED
+
+Step 2: Compare Similarity (two collections mode)
+  Input: Reference collection + new collection
+  Output: ~08 (very similar, minor differences)
+
+This tells you: "Not exact match, but content is very similar"
+
+**More Information**
+
+For more details about ISCC, visit: https://sum.iscc.codes/
+    ]]></help>
+    <expand macro="citations" />
+</tool>