Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
609362e
Add ISCC-SUM tools for similarity comparison and checksum verification
maartenpaul Nov 6, 2025
554e346
Enhance ISCC checksum verification tool to support expected code inpu…
maartenpaul Nov 6, 2025
3dbf198
Rename tool to "Verify ISCC hash" and update test input parameter for…
maartenpaul Nov 6, 2025
757be8a
Rename tool to "Compare ISCC hash similarity" and update test asserti…
maartenpaul Nov 6, 2025
402ef65
Add ISCC hash to test data file for checksum verification
maartenpaul Nov 6, 2025
3eccef8
Update tool version to 0.1.1 in macros.xml
maartenpaul Nov 6, 2025
5c00880
Update tool version to 0.1.1 in macros.xml
maartenpaul Nov 6, 2025
e2e23e6
Merge branch 'iscc-tools' of https://github.com/maartenpaul/galaxy-im…
maartenpaul Nov 6, 2025
66e3119
add extra test files
maartenpaul Nov 7, 2025
9f14bed
More consistent naming of ISCC hash
maartenpaul Nov 12, 2025
0b99e9d
Add support for single file and collection input modes in ISCC-SUM tool
maartenpaul Nov 28, 2025
0a2a791
Restructure the ISCC tools to serve different purposes
maartenpaul Nov 28, 2025
57e50ac
Update README for dataset similarity and combined mode
maartenpaul Nov 28, 2025
11487b0
resolve issues with markdown in
maartenpaul Nov 28, 2025
06908a0
Merge branch 'iscc-tools' of https://github.com/maartenpaul/galaxy-im…
maartenpaul Nov 28, 2025
0d90677
update readme
maartenpaul Nov 28, 2025
29619f8
update README
maartenpaul Nov 28, 2025
1863f57
Improve error handling in ISCC hash verification
maartenpaul Nov 28, 2025
916fc64
Fix formatting in ISCC similarity tool documentation
maartenpaul Nov 28, 2025
aea916f
Fix formatting in ISCC-SUM documentation for clarity
maartenpaul Nov 28, 2025
5ed6929
Refactor expected ISCC hash input handling and improve documentation …
maartenpaul Nov 28, 2025
136ac4f
Update tools/iscc-sum/iscc_similarity.xml
maartenpaul Dec 2, 2025
c6fdf72
Update tools/iscc-sum/iscc_similarity.xml
maartenpaul Dec 3, 2025
1a7cfce
Update tools/iscc-sum/iscc_similarity.xml
maartenpaul Dec 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions tools/iscc-sum/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# ISCC-SUM Tools for Galaxy

A suite of Galaxy tools for generating, verifying, and comparing ISCC (International Standard Content Code) hashes. ISCC-SUM provides content-derived identifiers for file integrity verification and similarity detection.

## Tools Overview

### 1. Generate ISCC hash (`iscc_sum.xml`)
**Purpose**: Create reference ISCC hashes for files

**Modes**:
- **Single file**: Generate one ISCC hash
- **Collection (individual)**: Generate hash per file with element identifiers
- **Collection (combined)**: Generate single hash for entire collection

**Use when**: You need to create reference hashes for later verification or comparison

---

### 2. Verify ISCC hash (`iscc_verify.xml`)
**Purpose**: Exact match verification - check if files are identical

**Modes**:
- **Single file**: Verify one file against expected hash
- **Collection**: Verify entire collection as one unit (generates combined ISCC)

**Use when**: You need to confirm files are EXACTLY the same (bit-for-bit)

**Note**: Even minor modifications will cause verification to FAIL

---

### 3. Compare ISCC similarity (`iscc_similarity.xml`)
**Purpose**: Content similarity detection - find related/modified files

**Modes**:
- **Two files**: Compare two specific files
- **Collection**: Find all similar files within a collection
- **Two collections**: Compare two entire collections as units (generates combined ISCC for each)

**Use when**: You want to detect:
- Near-duplicates (minor edits, format changes)
- Different versions of same content
- Related files (cropped images, edited documents)
- Whether two entire datasets are similar (even if not exactly identical)

**Key feature**: Works especially well on **large datasets** - ISCC-SUM is optimized for detecting similarity across many files efficiently

---

## Usage Scenarios

### Scenario 1: Quick Collection Integrity Check
**Goal**: Quickly verify if an entire collection has changed

```
Workflow:
1. Generate ISCC (combined mode) → Input: Reference collection (100 files)
Output: Single ISCC hash

2. [Transfer/storage/time passes]

3. Verify ISCC → Input: New collection (100 files) + reference hash
Output: "Status: OK" or "Status: FAILED"
```

**Result**: You know instantly if the collection as a whole has changed (but not which specific files)

---

### Scenario 2: Dataset Similarity When Verification Fails
**Goal**: Check if a large dataset matches reference exactly, and if not, how similar it is

**Why this is important**: Sometimes you receive a dataset that's been modified but want to know if it's still what you expect it to be or how much it differs from the reference.

```
Workflow:
1. Generate ISCC (combined mode) → Input: Reference collection (100 files)
Output: Single ISCC hash

2. [Transfer/storage/time passes - dataset may have changed]

3. Verify ISCC → Input: New collection (100 files) + reference hash
Output: "Status: FAILED - Hashes do not match"

4. Compare similarity (two collections) → Input: Reference collection + New collection
Threshold: 12
Output: "Similarity: ~08 (Very similar, minor changes)"
```

**Result**:
- You know the dataset is NOT identical (verification failed)
- You know it's still very similar (Hamming distance = 8)
- You can make an informed decision: Is it similar enough to use? Should you investigate the differences?

**Use cases**:
- Receiving scientific datasets from collaborators
- Verifying backups that may have compression/format changes
- Checking if processed data still matches original closely enough
- Quality control where "close enough" is acceptable

---

### Scenario 3: Duplicate Detection in Large Dataset
**Goal**: Find duplicate and near-duplicate files in large collection

**Why ISCC-SUM excels here**: Traditional hash functions (MD5, SHA) only detect exact duplicates. ISCC-SUM detects **content similarity**, making it ideal for:
- Image collections (same image with edits)
- Document repositories (different versions, formats)
- Genomic data (similar sequences with variations)

```
Workflow:
1. Compare similarity → Input: Large collection (1000+ files)
Threshold: 12 (configurable)
Output: Groups of similar files

Example output:
reference_image_001.png
~00 duplicate_001.jpg (exact duplicate, different format)
~08 edited_001.png (minor edits)

document_v1.txt
~05 document_v2.txt (very similar)
~12 document_draft.txt (moderate differences)
```

**Result**: Identify redundant files, track versions, find related content

---

### Scenario 4: Quality Control Pipeline
**Goal**: Automated verification in data processing pipeline

```
Galaxy Workflow:
1. [Data arrives] → Collection of files

2. Generate ISCC (combined mode) → Create hash for collection

3. Verify ISCC → Compare against expected reference
PASS: Continue workflow
FAIL: → 4. Compare similarity (two collections)
Report similarity score (~08 = very similar, ~48 = very different)
Decide: Accept if similar enough, or reject and investigate
```

**Result**: Automated QC catches data integrity issues and helps determine if differences are acceptable

---

## More Information

- ISCC specification: https://iscc.codes/
- ISCC-SUM GitHub: https://github.com/iscc/iscc-sum
213 changes: 213 additions & 0 deletions tools/iscc-sum/iscc_similarity.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
<tool id="iscc_sum_compare" name="Compare ISCC hash similarity" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="24.1">
<description>with ISCC-SUM</description>
<macros>
<import>macros.xml</import>
<import>creators.xml</import>
</macros>
<expand macro="requirements" />
<expand macro="version_command" />
<creator>
<expand macro="creators/iscc" />
<expand macro="creators/lco" />
<expand macro="creators/maartenpaul" />
<expand macro="creators/etzm" />
</creator>

<command detect_errors="exit_code"><![CDATA[
#if $input_type.input_selector == "two_files":
## Pairwise file comparison
iscc-sum --similar --threshold $threshold '$input_type.file1' '$input_type.file2' > '${output_file}'
#elif $input_type.input_selector == "two_collections":
## Two collections comparison - generate combined ISCC for each
ISCC1=\$(cat
#for $file in $input_type.collection1:
'$file'
#end for
| iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) &&

ISCC2=\$(cat
#for $file in $input_type.collection2:
'$file'
#end for
| iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) &&

## Create temp files for comparison
echo \$ISCC1 > temp1.txt &&
echo \$ISCC2 > temp2.txt &&

## Use iscc-sum to compare similarity and output directly
iscc-sum --similar --threshold $threshold temp1.txt temp2.txt > '${output_file}' || true
#else:
## Single collection - find similar files within
mkdir -p input_files &&
#for $file in $input_type.file_collection:
ln -s '$file' 'input_files/$file.element_identifier' &&
#end for
iscc-sum --similar --threshold $threshold input_files/* > '${output_file}'
#end if
]]></command>

<inputs>
<conditional name="input_type">
<param name="input_selector" type="select" label="Input type">
<option value="two_files">Compare two datasets</option>
<option value="collection">Find similar datasets in collection</option>
<option value="two_collections">Compare two collections</option>
</param>
<when value="two_files">
<param name="file1" type="data" format="data" label="First file"/>
<param name="file2" type="data" format="data" label="Second file"/>
Comment on lines +58 to +59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests only cover cases for working with images, right? As of today, is it supposed to work with other data too? If so, it'd make sense to add some more tests with other data. Otherwise, I'd suggest to restrict the input formats to images for now, to avoid negative user experiences:

Suggested change
<param name="file1" type="data" format="data" label="First file"/>
<param name="file2" type="data" format="data" label="Second file"/>
<param name="file1" type="data" format="tiff,png" label="First file"/>
<param name="file2" type="data" format="tiff,png" label="Second file"/>

Same also in lines 63, 66, 69, in tools/iscc-sum/iscc_verify.xml, and tools/iscc-sum/iscc_sum.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should work for any data. I have a fasta file in the test data, can make sure it is used in more different tests.

</when>
<when value="collection">
<param name="file_collection" type="data_collection" collection_type="list"
format="data" label="File collection to compare"/>
</when>
<when value="two_collections">
<param name="collection1" type="data_collection" collection_type="list" format="data"
label="First collection (reference)"
help="Reference collection - will generate combined ISCC"/>
<param name="collection2" type="data_collection" collection_type="list" format="data"
label="Second collection (to compare)"
help="Collection to compare against reference - will generate combined ISCC"/>
</when>
</conditional>
<param name="threshold" type="integer" value="12" min="0" max="256"
label="Similarity threshold (Hamming distance)"
help="Maximum Hamming distance for similarity matching. 0-5: Nearly identical, 6-12: Likely similar (default), 13-20: Probably somewhat similar"/>
</inputs>

<outputs>
<data name="output_file" format="txt" label="${tool.name} on ${on_string}"/>
</outputs>

<tests>
<!-- Test 1: Pairwise file comparison -->
<test expect_num_outputs="1">
<conditional name="input_type">
<param name="input_selector" value="two_files"/>
<param name="file1" value="test1.png"/>
<param name="file2" value="test1.png"/>
</conditional>
<param name="threshold" value="12"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tighter testing, wouldn't it make sense to set this threshold to 0, since the two inputs are identical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ISCC-SUM CLI tool the default threshold for similarity is set to 12. Only when files are similar it will return their similarity. As we discussed below I will revise the output. I think it should always output the similarity score for all files

<output name="output_file">
<assert_contents>
<has_text text="~00"/>
</assert_contents>
</output>
</test>
<!-- Test 2: Single collection comparison -->
<test expect_num_outputs="1">
<conditional name="input_type">
<param name="input_selector" value="collection"/>
<param name="file_collection">
<collection type="list">
<element name="file1" value="test1.png"/>
<element name="file2" value="test2.tiff"/>
<element name="file3" value="test2.tiff"/>
</collection>
</param>
</conditional>
<param name="threshold" value="12"/>
<output name="output_file">
<assert_contents>
<has_text text="~00"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of the tool for this test case is:

ISCC:K4AOMGOGQJA4Y46PAC4YPPA63GKD5RVFPR7FU3I4OOEW44TYXNYOTMY *input_files/test1
ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2
  ~00 ISCC:K4AGSPOSB5SS2X427WZ27QASTSBVTS55DXLMFDF7WOJKEOSTDEI3OXQ *input_files/test2 (1)

We can make the tests a bit stricter:

Suggested change
<has_text text="~00"/>
<has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file1$"/>
<has_line_matching expression="^ISCC:[A-Z0-9]+ \*input_files/file2$"/>
<has_line_matching expression="^ +\~00 ISCC:[A-Z0-9]+ \*input_files/file3$"/>

I guess, something similar can be done for lines 94 and 137.

</assert_contents>
</output>
</test>
<!-- Test 3: Two collections comparison -->
<test expect_num_outputs="1">
<conditional name="input_type">
<param name="input_selector" value="two_collections"/>
<param name="collection1">
<collection type="list">
<element name="sample1" value="test1.png"/>
<element name="sample2" value="test2.tiff"/>
</collection>
</param>
<param name="collection2">
<collection type="list">
<element name="sample1" value="test1.png"/>
<element name="sample2" value="test2.tiff"/>
</collection>
</param>
</conditional>
<param name="threshold" value="12"/>
<output name="output_file">
<assert_contents>
<has_text text="~00"/>
</assert_contents>
</output>
</test>
</tests>

<help><![CDATA[
**What it does**

Compares files or collections by their International Standard Content Code (ISCC) hashes to detect content similarity.

This tool can operate in three modes:

1. **Pairwise file comparison**: Compare two specific files
2. **Collection analysis**: Find all similar files within a collection
3. **Two collections comparison**: Compare two complete collections as units

**Similarity Measurement**

Similarity is measured using Hamming distance on the Data-Code component of ISCC codes.
Lower numbers indicate higher similarity:

- **0-5**: Nearly identical files
- **6-12**: Likely similar content (default threshold)
- **13-20**: Probably somewhat similar

**Input**

Choose one of:

- **Two files**: Individual files for pairwise comparison
- **One collection**: Find similar files within the collection
- **Two collections**: Compare collections as complete units

**Output**

**Mode 1 & 2** (Files/Collection): Similarity relationships showing reference files and similar matches with Hamming distance (~N)

Example::

document_v1.txt
~08 document_v2.txt
~12 document_draft.txt
Comment on lines +177 to +179
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by what I observed with planemo serve, I think the output rather looks like this, doesn't it?

Suggested change
document_v1.txt
~08 document_v2.txt
~12 document_draft.txt
ISCC:K4A... *input_files/document_v1
~08 ISCC:K4A... *input_files/document_v2
~12 ISCC:K4A... *input_files/document_draft


**Mode 3** (Two collections): Similarity comparison of combined ISCCs

Example::

ISCC:K4A... *temp1.txt
~08 ISCC:K4A... *temp2.txt
Comment on lines +185 to +186
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do the asterisks actually mean? They also occur in Mode 2.

I think this should be commented on in the help section.


**Use Cases**

- **Mode 1**: Compare two specific files for similarity
- **Mode 2**: Find duplicates/near-duplicates in large dataset (ISCC-SUM excels here!)
- **Mode 3**: Check if two collections are similar (but not exactly the same)

**Workflow: Verify then Compare**

Step 1: `Verify ISCC hash`_

.. _Verify ISCC hash: ?tool_id=toolshed.g2.bx.psu.edu/repos/imgteam/iscc_sum/iscc_sum_verify
Input: New collection + reference hash
Output: FAILED

Step 2: Compare Similarity (two collections mode)
Input: Reference collection + new collection
Output: ~08 (very similar, minor differences)

This tells you: "Not exact match, but content is very similar"

**More Information**

For more details about ISCC, visit: https://sum.iscc.codes/
]]></help>
<expand macro="citations" />
</tool>
Loading