-
Notifications
You must be signed in to change notification settings - Fork 19
Add more wrappers for ISCC-SUM #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
609362e
554e346
3dbf198
757be8a
402ef65
3eccef8
5c00880
e2e23e6
66e3119
9f14bed
0b99e9d
0a2a791
57e50ac
11487b0
06908a0
0d90677
29619f8
1863f57
916fc64
aea916f
5ed6929
136ac4f
c6fdf72
1a7cfce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
| # ISCC-SUM Tools for Galaxy | ||
|
|
||
| A suite of Galaxy tools for generating, verifying, and comparing ISCC (International Standard Content Code) hashes. ISCC-SUM provides content-derived identifiers for file integrity verification and similarity detection. | ||
|
|
||
| ## Tools Overview | ||
|
|
||
| ### 1. Generate ISCC hash (`iscc_sum.xml`) | ||
| **Purpose**: Create reference ISCC hashes for files | ||
|
|
||
| **Modes**: | ||
| - **Single file**: Generate one ISCC hash | ||
| - **Collection (individual)**: Generate hash per file with element identifiers | ||
| - **Collection (combined)**: Generate single hash for entire collection | ||
|
|
||
| **Use when**: You need to create reference hashes for later verification or comparison | ||
|
|
||
| --- | ||
|
|
||
| ### 2. Verify ISCC hash (`iscc_verify.xml`) | ||
| **Purpose**: Exact match verification - check if files are identical | ||
|
|
||
| **Modes**: | ||
| - **Single file**: Verify one file against expected hash | ||
| - **Collection**: Verify entire collection as one unit (generates combined ISCC) | ||
|
|
||
| **Use when**: You need to confirm files are EXACTLY the same (bit-for-bit) | ||
|
|
||
| **Note**: Even minor modifications will cause verification to FAIL | ||
|
|
||
| --- | ||
|
|
||
| ### 3. Compare ISCC similarity (`iscc_similarity.xml`) | ||
| **Purpose**: Content similarity detection - find related/modified files | ||
|
|
||
| **Modes**: | ||
| - **Two files**: Compare two specific files | ||
| - **Collection**: Find all similar files within a collection | ||
| - **Two collections**: Compare two entire collections as units (generates combined ISCC for each) | ||
|
|
||
| **Use when**: You want to detect: | ||
| - Near-duplicates (minor edits, format changes) | ||
| - Different versions of same content | ||
| - Related files (cropped images, edited documents) | ||
| - Whether two entire datasets are similar (even if not exactly identical) | ||
|
|
||
| **Key feature**: Works especially well on **large datasets** - ISCC-SUM is optimized for detecting similarity across many files efficiently | ||
|
|
||
| --- | ||
|
|
||
| ## Usage Scenarios | ||
|
|
||
| ### Scenario 1: Quick Collection Integrity Check | ||
| **Goal**: Quickly verify if an entire collection has changed | ||
|
|
||
| ``` | ||
| Workflow: | ||
| 1. Generate ISCC (combined mode) → Input: Reference collection (100 files) | ||
| Output: Single ISCC hash | ||
|
|
||
| 2. [Transfer/storage/time passes] | ||
|
|
||
| 3. Verify ISCC → Input: New collection (100 files) + reference hash | ||
| Output: "Status: OK" or "Status: FAILED" | ||
| ``` | ||
|
|
||
| **Result**: You know instantly if the collection as a whole has changed (but not which specific files) | ||
|
|
||
| --- | ||
|
|
||
| ### Scenario 2: Dataset Similarity When Verification Fails | ||
| **Goal**: Check if a large dataset matches reference exactly, and if not, how similar it is | ||
|
|
||
| **Why this is important**: Sometimes you receive a dataset that's been modified but want to know if it's still what you expect it to be or how much it differs from the reference. | ||
|
|
||
| ``` | ||
| Workflow: | ||
| 1. Generate ISCC (combined mode) → Input: Reference collection (100 files) | ||
| Output: Single ISCC hash | ||
|
|
||
| 2. [Transfer/storage/time passes - dataset may have changed] | ||
|
|
||
| 3. Verify ISCC → Input: New collection (100 files) + reference hash | ||
| Output: "Status: FAILED - Hashes do not match" | ||
|
|
||
| 4. Compare similarity (two collections) → Input: Reference collection + New collection | ||
| Threshold: 12 | ||
| Output: "Similarity: ~08 (Very similar, minor changes)" | ||
| ``` | ||
|
|
||
| **Result**: | ||
| - You know the dataset is NOT identical (verification failed) | ||
| - You know it's still very similar (Hamming distance = 8) | ||
| - You can make an informed decision: Is it similar enough to use? Should you investigate the differences? | ||
|
|
||
| **Use cases**: | ||
| - Receiving scientific datasets from collaborators | ||
| - Verifying backups that may have compression/format changes | ||
| - Checking if processed data still matches original closely enough | ||
| - Quality control where "close enough" is acceptable | ||
|
|
||
| --- | ||
|
|
||
| ### Scenario 3: Duplicate Detection in Large Dataset | ||
| **Goal**: Find duplicate and near-duplicate files in large collection | ||
|
|
||
| **Why ISCC-SUM excels here**: Traditional hash functions (MD5, SHA) only detect exact duplicates. ISCC-SUM detects **content similarity**, making it ideal for: | ||
| - Image collections (same image with edits) | ||
| - Document repositories (different versions, formats) | ||
| - Genomic data (similar sequences with variations) | ||
|
|
||
| ``` | ||
| Workflow: | ||
| 1. Compare similarity → Input: Large collection (1000+ files) | ||
| Threshold: 12 (configurable) | ||
| Output: Groups of similar files | ||
|
|
||
| Example output: | ||
| reference_image_001.png | ||
| ~00 duplicate_001.jpg (exact duplicate, different format) | ||
| ~08 edited_001.png (minor edits) | ||
|
|
||
| document_v1.txt | ||
| ~05 document_v2.txt (very similar) | ||
| ~12 document_draft.txt (moderate differences) | ||
| ``` | ||
|
|
||
| **Result**: Identify redundant files, track versions, find related content | ||
|
|
||
| --- | ||
|
|
||
| ### Scenario 4: Quality Control Pipeline | ||
| **Goal**: Automated verification in data processing pipeline | ||
|
|
||
| ``` | ||
| Galaxy Workflow: | ||
| 1. [Data arrives] → Collection of files | ||
|
|
||
| 2. Generate ISCC (combined mode) → Create hash for collection | ||
|
|
||
| 3. Verify ISCC → Compare against expected reference | ||
| ↓ | ||
| PASS: Continue workflow | ||
| FAIL: → 4. Compare similarity (two collections) | ||
| ↓ | ||
| Report similarity score (~08 = very similar, ~48 = very different) | ||
| Decide: Accept if similar enough, or reject and investigate | ||
| ``` | ||
|
|
||
| **Result**: Automated QC catches data integrity issues and helps determine if differences are acceptable | ||
|
|
||
| --- | ||
|
|
||
| ## More Information | ||
|
|
||
| - ISCC specification: https://iscc.codes/ | ||
| - ISCC-SUM GitHub: https://github.com/iscc/iscc-sum |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,213 @@ | ||||||||||||||
| <tool id="iscc_sum_compare" name="Compare ISCC hash similarity" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="24.1"> | ||||||||||||||
| <description>with ISCC-SUM</description> | ||||||||||||||
| <macros> | ||||||||||||||
| <import>macros.xml</import> | ||||||||||||||
| <import>creators.xml</import> | ||||||||||||||
| </macros> | ||||||||||||||
| <expand macro="requirements" /> | ||||||||||||||
| <expand macro="version_command" /> | ||||||||||||||
| <creator> | ||||||||||||||
| <expand macro="creators/iscc" /> | ||||||||||||||
| <expand macro="creators/lco" /> | ||||||||||||||
| <expand macro="creators/maartenpaul" /> | ||||||||||||||
| <expand macro="creators/etzm" /> | ||||||||||||||
| </creator> | ||||||||||||||
|
|
||||||||||||||
| <command detect_errors="exit_code"><![CDATA[ | ||||||||||||||
| #if $input_type.input_selector == "two_files": | ||||||||||||||
| ## Pairwise file comparison | ||||||||||||||
| iscc-sum --similar --threshold $threshold '$input_type.file1' '$input_type.file2' > '${output_file}' | ||||||||||||||
| #elif $input_type.input_selector == "two_collections": | ||||||||||||||
| ## Two collections comparison - generate combined ISCC for each | ||||||||||||||
| ISCC1=\$(cat | ||||||||||||||
| #for $file in $input_type.collection1: | ||||||||||||||
| '$file' | ||||||||||||||
| #end for | ||||||||||||||
| | iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) && | ||||||||||||||
|
|
||||||||||||||
| ISCC2=\$(cat | ||||||||||||||
| #for $file in $input_type.collection2: | ||||||||||||||
| '$file' | ||||||||||||||
| #end for | ||||||||||||||
| | iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) && | ||||||||||||||
|
|
||||||||||||||
| ## Create temp files for comparison | ||||||||||||||
| echo \$ISCC1 > temp1.txt && | ||||||||||||||
| echo \$ISCC2 > temp2.txt && | ||||||||||||||
|
|
||||||||||||||
| ## Use iscc-sum to compare similarity and output directly | ||||||||||||||
| iscc-sum --similar --threshold $threshold temp1.txt temp2.txt > '${output_file}' || true | ||||||||||||||
| #else: | ||||||||||||||
| ## Single collection - find similar files within | ||||||||||||||
| mkdir -p input_files && | ||||||||||||||
| #for $file in $input_type.file_collection: | ||||||||||||||
| ln -s '$file' 'input_files/$file.element_identifier' && | ||||||||||||||
| #end for | ||||||||||||||
| iscc-sum --similar --threshold $threshold input_files/* > '${output_file}' | ||||||||||||||
| #end if | ||||||||||||||
| ]]></command> | ||||||||||||||
|
|
||||||||||||||
| <inputs> | ||||||||||||||
| <conditional name="input_type"> | ||||||||||||||
| <param name="input_selector" type="select" label="Input type"> | ||||||||||||||
| <option value="two_files">Compare two datasets</option> | ||||||||||||||
| <option value="collection">Find similar datasets in collection</option> | ||||||||||||||
| <option value="two_collections">Compare two collections</option> | ||||||||||||||
| </param> | ||||||||||||||
| <when value="two_files"> | ||||||||||||||
| <param name="file1" type="data" format="data" label="First file"/> | ||||||||||||||
| <param name="file2" type="data" format="data" label="Second file"/> | ||||||||||||||
| </when> | ||||||||||||||
| <when value="collection"> | ||||||||||||||
| <param name="file_collection" type="data_collection" collection_type="list" | ||||||||||||||
| format="data" label="File collection to compare"/> | ||||||||||||||
| </when> | ||||||||||||||
| <when value="two_collections"> | ||||||||||||||
| <param name="collection1" type="data_collection" collection_type="list" format="data" | ||||||||||||||
| label="First collection (reference)" | ||||||||||||||
| help="Reference collection - will generate combined ISCC"/> | ||||||||||||||
| <param name="collection2" type="data_collection" collection_type="list" format="data" | ||||||||||||||
| label="Second collection (to compare)" | ||||||||||||||
| help="Collection to compare against reference - will generate combined ISCC"/> | ||||||||||||||
| </when> | ||||||||||||||
| </conditional> | ||||||||||||||
| <param name="threshold" type="integer" value="12" min="0" max="256" | ||||||||||||||
| label="Similarity threshold (Hamming distance)" | ||||||||||||||
| help="Maximum Hamming distance for similarity matching. 0-5: Nearly identical, 6-12: Likely similar (default), 13-20: Probably somewhat similar"/> | ||||||||||||||
| </inputs> | ||||||||||||||
|
|
||||||||||||||
| <outputs> | ||||||||||||||
| <data name="output_file" format="txt" label="${tool.name} on ${on_string}"/> | ||||||||||||||
| </outputs> | ||||||||||||||
|
|
||||||||||||||
| <tests> | ||||||||||||||
| <!-- Test 1: Pairwise file comparison --> | ||||||||||||||
| <test expect_num_outputs="1"> | ||||||||||||||
| <conditional name="input_type"> | ||||||||||||||
| <param name="input_selector" value="two_files"/> | ||||||||||||||
| <param name="file1" value="test1.png"/> | ||||||||||||||
| <param name="file2" value="test1.png"/> | ||||||||||||||
| </conditional> | ||||||||||||||
| <param name="threshold" value="12"/> | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For tighter testing, wouldn't it make sense to set this threshold to 0, since the two inputs are identical?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the ISCC-SUM CLI tool the default threshold for similarity is set to 12. Only when files are similar it will return their similarity. As we discussed below I will revise the output. I think it should always output the similarity score for all files |
||||||||||||||
| <output name="output_file"> | ||||||||||||||
| <assert_contents> | ||||||||||||||
| <has_text text="~00"/> | ||||||||||||||
| </assert_contents> | ||||||||||||||
| </output> | ||||||||||||||
| </test> | ||||||||||||||
| <!-- Test 2: Single collection comparison --> | ||||||||||||||
| <test expect_num_outputs="1"> | ||||||||||||||
| <conditional name="input_type"> | ||||||||||||||
| <param name="input_selector" value="collection"/> | ||||||||||||||
| <param name="file_collection"> | ||||||||||||||
| <collection type="list"> | ||||||||||||||
| <element name="file1" value="test1.png"/> | ||||||||||||||
| <element name="file2" value="test2.tiff"/> | ||||||||||||||
| <element name="file3" value="test2.tiff"/> | ||||||||||||||
| </collection> | ||||||||||||||
| </param> | ||||||||||||||
| </conditional> | ||||||||||||||
| <param name="threshold" value="12"/> | ||||||||||||||
| <output name="output_file"> | ||||||||||||||
| <assert_contents> | ||||||||||||||
| <has_text text="~00"/> | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The output of the tool for this test case is: We can make the tests a bit stricter:
Suggested change
I guess, something similar can be done for lines 94 and 137. |
||||||||||||||
| </assert_contents> | ||||||||||||||
| </output> | ||||||||||||||
| </test> | ||||||||||||||
| <!-- Test 3: Two collections comparison --> | ||||||||||||||
| <test expect_num_outputs="1"> | ||||||||||||||
| <conditional name="input_type"> | ||||||||||||||
| <param name="input_selector" value="two_collections"/> | ||||||||||||||
| <param name="collection1"> | ||||||||||||||
| <collection type="list"> | ||||||||||||||
| <element name="sample1" value="test1.png"/> | ||||||||||||||
| <element name="sample2" value="test2.tiff"/> | ||||||||||||||
| </collection> | ||||||||||||||
| </param> | ||||||||||||||
| <param name="collection2"> | ||||||||||||||
| <collection type="list"> | ||||||||||||||
| <element name="sample1" value="test1.png"/> | ||||||||||||||
| <element name="sample2" value="test2.tiff"/> | ||||||||||||||
| </collection> | ||||||||||||||
| </param> | ||||||||||||||
| </conditional> | ||||||||||||||
| <param name="threshold" value="12"/> | ||||||||||||||
| <output name="output_file"> | ||||||||||||||
| <assert_contents> | ||||||||||||||
| <has_text text="~00"/> | ||||||||||||||
| </assert_contents> | ||||||||||||||
| </output> | ||||||||||||||
| </test> | ||||||||||||||
| </tests> | ||||||||||||||
|
|
||||||||||||||
| <help><![CDATA[ | ||||||||||||||
| **What it does** | ||||||||||||||
maartenpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| Compares files or collections by their International Standard Content Code (ISCC) hashes to detect content similarity. | ||||||||||||||
|
|
||||||||||||||
| This tool can operate in three modes: | ||||||||||||||
|
|
||||||||||||||
| 1. **Pairwise file comparison**: Compare two specific files | ||||||||||||||
| 2. **Collection analysis**: Find all similar files within a collection | ||||||||||||||
| 3. **Two collections comparison**: Compare two complete collections as units | ||||||||||||||
|
|
||||||||||||||
| **Similarity Measurement** | ||||||||||||||
|
|
||||||||||||||
| Similarity is measured using Hamming distance on the Data-Code component of ISCC codes. | ||||||||||||||
| Lower numbers indicate higher similarity: | ||||||||||||||
|
|
||||||||||||||
| - **0-5**: Nearly identical files | ||||||||||||||
| - **6-12**: Likely similar content (default threshold) | ||||||||||||||
| - **13-20**: Probably somewhat similar | ||||||||||||||
|
|
||||||||||||||
| **Input** | ||||||||||||||
|
|
||||||||||||||
| Choose one of: | ||||||||||||||
|
|
||||||||||||||
| - **Two files**: Individual files for pairwise comparison | ||||||||||||||
| - **One collection**: Find similar files within the collection | ||||||||||||||
| - **Two collections**: Compare collections as complete units | ||||||||||||||
|
|
||||||||||||||
| **Output** | ||||||||||||||
|
|
||||||||||||||
| **Mode 1 & 2** (Files/Collection): Similarity relationships showing reference files and similar matches with Hamming distance (~N) | ||||||||||||||
|
|
||||||||||||||
| Example:: | ||||||||||||||
|
|
||||||||||||||
| document_v1.txt | ||||||||||||||
| ~08 document_v2.txt | ||||||||||||||
| ~12 document_draft.txt | ||||||||||||||
|
Comment on lines
+177
to
+179
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Judging by what I observed with
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| **Mode 3** (Two collections): Similarity comparison of combined ISCCs | ||||||||||||||
|
|
||||||||||||||
| Example:: | ||||||||||||||
|
|
||||||||||||||
| ISCC:K4A... *temp1.txt | ||||||||||||||
| ~08 ISCC:K4A... *temp2.txt | ||||||||||||||
|
Comment on lines
+185
to
+186
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do the asterisks actually mean? They also occur in Mode 2. I think this should be commented on in the help section. |
||||||||||||||
|
|
||||||||||||||
| **Use Cases** | ||||||||||||||
|
|
||||||||||||||
| - **Mode 1**: Compare two specific files for similarity | ||||||||||||||
| - **Mode 2**: Find duplicates/near-duplicates in large dataset (ISCC-SUM excels here!) | ||||||||||||||
| - **Mode 3**: Check if two collections are similar (but not exactly the same) | ||||||||||||||
|
|
||||||||||||||
| **Workflow: Verify then Compare** | ||||||||||||||
|
|
||||||||||||||
| Step 1: `Verify ISCC hash`_ | ||||||||||||||
|
|
||||||||||||||
| .. _Verify ISCC hash: ?tool_id=toolshed.g2.bx.psu.edu/repos/imgteam/iscc_sum/iscc_sum_verify | ||||||||||||||
| Input: New collection + reference hash | ||||||||||||||
| Output: FAILED | ||||||||||||||
|
|
||||||||||||||
| Step 2: Compare Similarity (two collections mode) | ||||||||||||||
| Input: Reference collection + new collection | ||||||||||||||
| Output: ~08 (very similar, minor differences) | ||||||||||||||
|
|
||||||||||||||
| This tells you: "Not exact match, but content is very similar" | ||||||||||||||
|
|
||||||||||||||
| **More Information** | ||||||||||||||
|
|
||||||||||||||
| For more details about ISCC, visit: https://sum.iscc.codes/ | ||||||||||||||
| ]]></help> | ||||||||||||||
| <expand macro="citations" /> | ||||||||||||||
| </tool> | ||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests only cover cases for working with images, right? As of today, is it supposed to work with other data too? If so, it'd make sense to add some more tests with other data. Otherwise, I'd suggest to restrict the input formats to images for now, to avoid negative user experiences:
Same also in lines 63, 66, 69, in tools/iscc-sum/iscc_verify.xml, and tools/iscc-sum/iscc_sum.xml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should work for any data. I have a fasta file in the test data, can make sure it is used in more different tests.