FuzzyHash Scanner

The FuzzyHash Scanner detects code reuse and similarities between files by analyzing their content at the block level using ssdeep fuzzy hashing.

What It Does

The scanner builds a database of known files (tools, malware families, legitimate software) and compares new samples against this database to identify:

Code Reuse - Detect when payloads contain code from known tools
Tool Variants - Find modified versions of existing tools
Similar Techniques - Identify files using similar implementation patterns
Source Attribution - Link samples to known malware families or tool repositories

How It Works

Database Creation

create_db_from_folder() builds the signature database:

File Discovery - Recursively scans folder for files matching configured extensions (exe, dll, bin)
Source Detection - Uses find_git_root() to identify Git repositories and extract remote URLs
Block Processing - Calls _create_blocks() to split files into 4KB blocks with ssdeep hashes
Storage - Compresses and stores all file metadata, block hashes, and raw block data

Similarity Analysis

analyze_files() compares target files against the database:

Target Processing - Creates the same 4KB block structure for target files
Block Matching - Uses _compare_blocks() to find similarities with ssdeep comparison
Region Detection - Groups consecutive matching blocks into regions
Results Ranking - Returns top 3 matches per file with similarity percentages

Technical Implementation

Block-Level Analysis

Unlike traditional file hashing, this scanner works at the 4KB block level:

Granular Detection - Finds partial matches even in heavily modified files
Region Identification - Shows exactly which file segments are similar
Content Preservation - Stores actual block data for hex/ASCII comparison
Flexibility - Detects similarities regardless of file size differences

Git Repository Integration

find_git_root() organizes files by source:

Parses .git/config to extract remote repository URLs
Converts SSH URLs to HTTPS format for consistency
Groups files by their origin (GitHub repos, private collections)

Database Compression

All data is stored efficiently:

Uses zlib compression on the entire database
Shortens JSON keys to minimize storage size
Only decompresses block data when needed for display

Data Structures

FileMetadata

Path - Original file location
MD5 - File hash for deduplication
Size - File size in bytes
Blocks - Array of BlockMetadata objects
Date Added - Timestamp of database entry

BlockMetadata

Index - Block position in file
Hash - ssdeep fuzzy hash
Offsets - Start and end positions
Data - Compressed raw block content

MatchingRegion

Source/Target Offsets - Region boundaries in both files
Length - Total region size
Similarity - Average similarity score
Block Count - Number of blocks in region

Use Cases

Payload Analysis - Determine if a sample contains code from known tools
Attribution - Link malware samples to specific tool families or authors
Variant Detection - Find modified versions of existing tools in your collection
Code Reuse Research - Study how code gets reused across different tools and families

File Storage

Database Location: Utils\DoppelgangerDB\FuzzyHash\FuzzyHash.db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FuzzyHash Scanner

FuzzyHash Scanner

What It Does

How It Works

Database Creation

Similarity Analysis

Technical Implementation

Block-Level Analysis

Git Repository Integration

Database Compression

Data Structures

FileMetadata

BlockMetadata

MatchingRegion

Use Cases

File Storage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally