diff --git a/tools/iscc-sum/README.md b/tools/iscc-sum/README.md new file mode 100644 index 00000000..4f7fa8a1 --- /dev/null +++ b/tools/iscc-sum/README.md @@ -0,0 +1,156 @@ +# ISCC-SUM Tools for Galaxy + +A suite of Galaxy tools for generating, verifying, and comparing ISCC (International Standard Content Code) hashes. ISCC-SUM provides content-derived identifiers for file integrity verification and similarity detection. + +## Tools Overview + +### 1. Generate ISCC hash (`iscc_sum.xml`) +**Purpose**: Create reference ISCC hashes for files + +**Modes**: +- **Single file**: Generate one ISCC hash +- **Collection (individual)**: Generate hash per file with element identifiers +- **Collection (combined)**: Generate single hash for entire collection + +**Use when**: You need to create reference hashes for later verification or comparison + +--- + +### 2. Verify ISCC hash (`iscc_verify.xml`) +**Purpose**: Exact match verification - check if files are identical + +**Modes**: +- **Single file**: Verify one file against expected hash +- **Collection**: Verify entire collection as one unit (generates combined ISCC) + +**Use when**: You need to confirm files are EXACTLY the same (bit-for-bit) + +**Note**: Even minor modifications will cause verification to FAIL + +--- + +### 3. Compare ISCC similarity (`iscc_similarity.xml`) +**Purpose**: Content similarity detection - find related/modified files + +**Modes**: +- **Two files**: Compare two specific files +- **Collection**: Find all similar files within a collection +- **Two collections**: Compare two entire collections as units (generates combined ISCC for each) + +**Use when**: You want to detect: +- Near-duplicates (minor edits, format changes) +- Different versions of same content +- Related files (cropped images, edited documents) +- Whether two entire datasets are similar (even if not exactly identical) + +**Key feature**: Works especially well on **large datasets** - ISCC-SUM is optimized for detecting similarity across many files efficiently + +--- + +## Usage Scenarios + +### Scenario 1: Quick Collection Integrity Check +**Goal**: Quickly verify if an entire collection has changed + +``` +Workflow: +1. Generate ISCC (combined mode) → Input: Reference collection (100 files) + Output: Single ISCC hash + +2. [Transfer/storage/time passes] + +3. Verify ISCC → Input: New collection (100 files) + reference hash + Output: "Status: OK" or "Status: FAILED" +``` + +**Result**: You know instantly if the collection as a whole has changed (but not which specific files) + +--- + +### Scenario 2: Dataset Similarity When Verification Fails +**Goal**: Check if a large dataset matches reference exactly, and if not, how similar it is + +**Why this is important**: Sometimes you receive a dataset that's been modified but want to know if it's still what you expect it to be or how much it differs from the reference. + +``` +Workflow: +1. Generate ISCC (combined mode) → Input: Reference collection (100 files) + Output: Single ISCC hash + +2. [Transfer/storage/time passes - dataset may have changed] + +3. Verify ISCC → Input: New collection (100 files) + reference hash + Output: "Status: FAILED - Hashes do not match" + +4. Compare similarity (two collections) → Input: Reference collection + New collection + Threshold: 12 + Output: "Similarity: ~08 (Very similar, minor changes)" +``` + +**Result**: +- You know the dataset is NOT identical (verification failed) +- You know it's still very similar (Hamming distance = 8) +- You can make an informed decision: Is it similar enough to use? Should you investigate the differences? + +**Use cases**: +- Receiving scientific datasets from collaborators +- Verifying backups that may have compression/format changes +- Checking if processed data still matches original closely enough +- Quality control where "close enough" is acceptable + +--- + +### Scenario 3: Duplicate Detection in Large Dataset +**Goal**: Find duplicate and near-duplicate files in large collection + +**Why ISCC-SUM excels here**: Traditional hash functions (MD5, SHA) only detect exact duplicates. ISCC-SUM detects **content similarity**, making it ideal for: +- Image collections (same image with edits) +- Document repositories (different versions, formats) +- Genomic data (similar sequences with variations) + +``` +Workflow: +1. Compare similarity → Input: Large collection (1000+ files) + Threshold: 12 (configurable) + Output: Groups of similar files + +Example output: + reference_image_001.png + ~00 duplicate_001.jpg (exact duplicate, different format) + ~08 edited_001.png (minor edits) + + document_v1.txt + ~05 document_v2.txt (very similar) + ~12 document_draft.txt (moderate differences) +``` + +**Result**: Identify redundant files, track versions, find related content + +--- + +### Scenario 4: Quality Control Pipeline +**Goal**: Automated verification in data processing pipeline + +``` +Galaxy Workflow: +1. [Data arrives] → Collection of files + +2. Generate ISCC (combined mode) → Create hash for collection + +3. Verify ISCC → Compare against expected reference + ↓ + PASS: Continue workflow + FAIL: → 4. Compare similarity (two collections) + ↓ + Report similarity score (~08 = very similar, ~48 = very different) + Decide: Accept if similar enough, or reject and investigate +``` + +**Result**: Automated QC catches data integrity issues and helps determine if differences are acceptable + +--- + +## More Information + +- ISCC specification: https://iscc.codes/ +- ISCC-SUM GitHub: https://github.com/iscc/iscc-sum diff --git a/tools/iscc-sum/iscc_similarity.xml b/tools/iscc-sum/iscc_similarity.xml new file mode 100644 index 00000000..9e9c82c2 --- /dev/null +++ b/tools/iscc-sum/iscc_similarity.xml @@ -0,0 +1,213 @@ + + with ISCC-SUM + + macros.xml + creators.xml + + + + + + + + + + + '${output_file}' + #elif $input_type.input_selector == "two_collections": + ## Two collections comparison - generate combined ISCC for each + ISCC1=\$(cat + #for $file in $input_type.collection1: + '$file' + #end for + | iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) && + + ISCC2=\$(cat + #for $file in $input_type.collection2: + '$file' + #end for + | iscc-sum - | cut -d':' -f2 | cut -d' ' -f1) && + + ## Create temp files for comparison + echo \$ISCC1 > temp1.txt && + echo \$ISCC2 > temp2.txt && + + ## Use iscc-sum to compare similarity and output directly + iscc-sum --similar --threshold $threshold temp1.txt temp2.txt > '${output_file}' || true + #else: + ## Single collection - find similar files within + mkdir -p input_files && + #for $file in $input_type.file_collection: + ln -s '$file' 'input_files/$file.element_identifier' && + #end for + iscc-sum --similar --threshold $threshold input_files/* > '${output_file}' + #end if + ]]> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/tools/iscc-sum/iscc_sum.xml b/tools/iscc-sum/iscc_sum.xml index 3fc8eee3..c00e3338 100644 --- a/tools/iscc-sum/iscc_sum.xml +++ b/tools/iscc-sum/iscc_sum.xml @@ -15,12 +15,49 @@ '${output_file}' + #if $input_type.input_selector == "single": + ## Single file mode + iscc-sum '${input_type.input_file}' | cut -d':' -f2 | cut -d' ' -f1 > '${output_file}' + #else: + ## Collection mode + #if $input_type.calculate_individual: + ## Calculate ISCC hash for each file individually + #for $input in $input_type.input_collection: + echo '${input.element_identifier}' >> '${output_file}' && + iscc-sum '$input' | cut -d':' -f2 | cut -d' ' -f1 >> '${output_file}' && + echo '' >> '${output_file}' && + #end for + true + #else: + ## Calculate single ISCC hash for all files together + cat + #for $input in $input_type.input_collection: + '$input' + #end for + | iscc-sum - | cut -d':' -f2 | cut -d' ' -f1 > '${output_file}' + #end if + #end if ]]> - + - + + + + + + + + + + + + + @@ -28,53 +65,135 @@ + - + + + + + + - + + + + - + + + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/tools/iscc-sum/iscc_verify.xml b/tools/iscc-sum/iscc_verify.xml new file mode 100644 index 00000000..1bd5bd5f --- /dev/null +++ b/tools/iscc-sum/iscc_verify.xml @@ -0,0 +1,194 @@ + + with ISCC-SUM + + macros.xml + creators.xml + + + + + + + + + + + &2; + echo "Found: \${#EXPECTED} characters" >&2; + echo "Content: \$EXPECTED" >&2; + exit 1; + fi && + + ## Output verification report + #if $input_type.input_selector == "single": + echo "File: ${input_type.input_file.element_identifier}" > '${output_file}' && + #else: + echo "Collection: ${input_type.input_collection.name}" > '${output_file}' && + #end if + echo "Expected: \$EXPECTED" >> '${output_file}' && + echo "Generated: \$GENERATED" >> '${output_file}' && + echo "" >> '${output_file}' && + if [ "\$GENERATED" = "\$EXPECTED" ]; then + echo "Status: OK - Hashes match" >> '${output_file}'; + else + echo "Status: FAILED - Hashes do not match" >> '${output_file}'; + fi + ]]> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/tools/iscc-sum/macros.xml b/tools/iscc-sum/macros.xml index 9b2a5c80..f46688d5 100644 --- a/tools/iscc-sum/macros.xml +++ b/tools/iscc-sum/macros.xml @@ -1,6 +1,6 @@ 0.1.0 - 0 + 1 diff --git a/tools/iscc-sum/test-data/test1_iscc.txt b/tools/iscc-sum/test-data/test1_iscc.txt new file mode 100644 index 00000000..42bb7c1c --- /dev/null +++ b/tools/iscc-sum/test-data/test1_iscc.txt @@ -0,0 +1 @@ +K4AOMGOGQJA4Y46PAC4YPPA63GKD5RVFPR7FU3I4OOEW44TYXNYOTMY \ No newline at end of file