Skip to content
BlackSnufkin edited this page Sep 6, 2025 · 1 revision

FuzzyHash Scanner

The FuzzyHash Scanner detects code reuse and similarities between files by analyzing their content at the block level using ssdeep fuzzy hashing.

What It Does

The scanner builds a database of known files (tools, malware families, legitimate software) and compares new samples against this database to identify:

  • Code Reuse - Detect when payloads contain code from known tools
  • Tool Variants - Find modified versions of existing tools
  • Similar Techniques - Identify files using similar implementation patterns
  • Source Attribution - Link samples to known malware families or tool repositories

How It Works

Database Creation

create_db_from_folder() builds the signature database:

  1. File Discovery - Recursively scans folder for files matching configured extensions (exe, dll, bin)
  2. Source Detection - Uses find_git_root() to identify Git repositories and extract remote URLs
  3. Block Processing - Calls _create_blocks() to split files into 4KB blocks with ssdeep hashes
  4. Storage - Compresses and stores all file metadata, block hashes, and raw block data

Similarity Analysis

analyze_files() compares target files against the database:

  1. Target Processing - Creates the same 4KB block structure for target files
  2. Block Matching - Uses _compare_blocks() to find similarities with ssdeep comparison
  3. Region Detection - Groups consecutive matching blocks into regions
  4. Results Ranking - Returns top 3 matches per file with similarity percentages

Technical Implementation

Block-Level Analysis

Unlike traditional file hashing, this scanner works at the 4KB block level:

  • Granular Detection - Finds partial matches even in heavily modified files
  • Region Identification - Shows exactly which file segments are similar
  • Content Preservation - Stores actual block data for hex/ASCII comparison
  • Flexibility - Detects similarities regardless of file size differences

Git Repository Integration

find_git_root() organizes files by source:

  • Parses .git/config to extract remote repository URLs
  • Converts SSH URLs to HTTPS format for consistency
  • Groups files by their origin (GitHub repos, private collections)

Database Compression

All data is stored efficiently:

  • Uses zlib compression on the entire database
  • Shortens JSON keys to minimize storage size
  • Only decompresses block data when needed for display

Data Structures

FileMetadata

  • Path - Original file location
  • MD5 - File hash for deduplication
  • Size - File size in bytes
  • Blocks - Array of BlockMetadata objects
  • Date Added - Timestamp of database entry

BlockMetadata

  • Index - Block position in file
  • Hash - ssdeep fuzzy hash
  • Offsets - Start and end positions
  • Data - Compressed raw block content

MatchingRegion

  • Source/Target Offsets - Region boundaries in both files
  • Length - Total region size
  • Similarity - Average similarity score
  • Block Count - Number of blocks in region

Use Cases

  • Payload Analysis - Determine if a sample contains code from known tools
  • Attribution - Link malware samples to specific tool families or authors
  • Variant Detection - Find modified versions of existing tools in your collection
  • Code Reuse Research - Study how code gets reused across different tools and families

File Storage

Database Location: Utils\DoppelgangerDB\FuzzyHash\FuzzyHash.db

Clone this wiki locally