-
Notifications
You must be signed in to change notification settings - Fork 152
FuzzyHash Scanner
BlackSnufkin edited this page Sep 6, 2025
·
1 revision
The FuzzyHash Scanner detects code reuse and similarities between files by analyzing their content at the block level using ssdeep fuzzy hashing.
The scanner builds a database of known files (tools, malware families, legitimate software) and compares new samples against this database to identify:
- Code Reuse - Detect when payloads contain code from known tools
- Tool Variants - Find modified versions of existing tools
- Similar Techniques - Identify files using similar implementation patterns
- Source Attribution - Link samples to known malware families or tool repositories
create_db_from_folder() builds the signature database:
- File Discovery - Recursively scans folder for files matching configured extensions (exe, dll, bin)
-
Source Detection - Uses
find_git_root()to identify Git repositories and extract remote URLs -
Block Processing - Calls
_create_blocks()to split files into 4KB blocks with ssdeep hashes - Storage - Compresses and stores all file metadata, block hashes, and raw block data
analyze_files() compares target files against the database:
- Target Processing - Creates the same 4KB block structure for target files
-
Block Matching - Uses
_compare_blocks()to find similarities with ssdeep comparison - Region Detection - Groups consecutive matching blocks into regions
- Results Ranking - Returns top 3 matches per file with similarity percentages
Unlike traditional file hashing, this scanner works at the 4KB block level:
- Granular Detection - Finds partial matches even in heavily modified files
- Region Identification - Shows exactly which file segments are similar
- Content Preservation - Stores actual block data for hex/ASCII comparison
- Flexibility - Detects similarities regardless of file size differences
find_git_root() organizes files by source:
- Parses
.git/configto extract remote repository URLs - Converts SSH URLs to HTTPS format for consistency
- Groups files by their origin (GitHub repos, private collections)
All data is stored efficiently:
- Uses zlib compression on the entire database
- Shortens JSON keys to minimize storage size
- Only decompresses block data when needed for display
- Path - Original file location
- MD5 - File hash for deduplication
- Size - File size in bytes
- Blocks - Array of BlockMetadata objects
- Date Added - Timestamp of database entry
- Index - Block position in file
- Hash - ssdeep fuzzy hash
- Offsets - Start and end positions
- Data - Compressed raw block content
- Source/Target Offsets - Region boundaries in both files
- Length - Total region size
- Similarity - Average similarity score
- Block Count - Number of blocks in region
- Payload Analysis - Determine if a sample contains code from known tools
- Attribution - Link malware samples to specific tool families or authors
- Variant Detection - Find modified versions of existing tools in your collection
- Code Reuse Research - Study how code gets reused across different tools and families
Database Location: Utils\DoppelgangerDB\FuzzyHash\FuzzyHash.db