Dual-Core Security Auditor: Deep Content Inspection (PII) & Metadata Forensics
CutterFish is a modular, open-source security auditing tool that detects sensitive data leaks (PII) across local file systems. It is designed with a Dual-Scan Engine capable of analyzing both visible file content and hidden metadata structures.
β οΈ This project is in active early development (v0.1). The Phase 1 Core Engine is functional. Contributions, issues, and feedback are welcome.
- Key Capabilities
- Getting Started
- Usage
- Architecture
- Data Detection Master List
- Known Limitations
- Roadmap
- Contributing
- License
| Mode | Description | Status |
|---|---|---|
| π Deep Content Scan | Regex and keyword matching to find PII (DNI, SSN, API Keys, passwords...) inside text, logs, source code, PDFs and DOCX files. | β Available |
| π΅οΈ Metadata Forensics | Extracts hidden data from binary files: GPS in images (EXIF), author history in Office/PDF documents, software traces. | Planned |
| βοΈ PII Redaction | Automatically censors or removes detected sensitive data directly from the source files, with a configurable redaction strategy (mask, delete, replace). | Planned |
| π Hybrid Mode | Simultaneous content + metadata audit for a 360Β° security overview. | Planned |
- Java 21+ β Download from Adoptium if not installed
Linux / Mac / Kali:
git clone https://github.com/valerodev/cutterfish.git
cd cutterfish
sudo ./install.shWindows:
git clone https://github.com/valerodev/cutterfish.git
cd cutterfish
install.batThe installer checks your Java version, copies the JAR to the right location, and registers the cutterfish command system-wide. After installation, open a new terminal and run it from anywhere:
cutterfishIf you prefer not to install, you can also run the JAR directly:
java -jar cutterfish.jar
Requires Maven 3.8+:
mvn clean package
# Output: target/cutterfish.jarAfter installation, run from anywhere:
cutterfishYou will be prompted for:
========================================
π¦ CUTTERFISH v0.1
========================================
Enter target directory path: /path/to/scan
Select compliance standard [EU / US]: EU
==================================================
SENSITIVE DATA FOUND
==================================================
[HIGH] API_KEY | Line 3 | .env | apikey=xK92mPqZ8nR3...
[HIGH] PASSWORD | Line 9 | config.yml | password=s3cr3tPass! (x2)
[HIGH] DNI | Line 5 | contract.pdf | 74030258C
[HIGH] CREDIT_CARD | Line 7 | report.log | 4539 1488 0343 6467
[MEDIUM] EMAIL | Line 12 | users.csv | admin@example.com (x5)
[MEDIUM] PHONE | Line 44 | contacts.txt | 644 557 788
==================================================
SCAN RESULTS SUMMARY
==================================================
Files scanned: 134 Duration: 3s
Files with findings: 4 High confidence: 10
Total findings: 18 Medium confidence: 8
FINDINGS BY TYPE:
API_KEY: 1 | PASSWORD: 2 | DNI: 1 | CREDIT_CARD: 1 | EMAIL: 9 | PHONE: 4
cutterfish/
βββ cutterfish.jar
βββ install.sh
βββ install.bat
βββ pom.xml
βββ README.md
βββ src/
βββ main/
βββ java/
βββ cutterfish/
βββ Main.java
βββ features/
β βββ scanning/
β β βββ ScanManager.java
β β βββ ScanResults.java
β β βββ ScanException.java
β β βββ FileCrawler.java
β β βββ analysis/
β β β βββ PiiAnalyzer.java
β β β βββ EngineFactory.java
β β β βββ FindsValidator.java
β β β βββ AnalysisException.java
β β β βββ EngineException.java
β β βββ conversion/
β β β βββ TextExtractor.java
β β β βββ AbstractTextExtractor.java
β β β βββ TextExtractorFactory.java
β β β βββ PlainTextExtractor.java
β β β βββ PdfExtractor.java
β β β βββ DocxExtractor.java
β β β βββ ExtractionException.java
β β βββ metadata/
β βββ reporting/
β β βββ ConsolePrinter.java
β β βββ ReportGenerator.java
β βββ ui/
β βββ CommandLineInterface.java
βββ shared/
βββ models/
βββ ScanContext.java
βββ SensitiveData.java
| Class | Role |
|---|---|
ScanManager |
Central orchestrator β coordinates crawling, extraction, analysis and reporting |
FileCrawler |
Recursive filesystem traversal with fail-loud error propagation |
TextExtractorFactory |
Selects the right extractor (plain text / PDF / DOCX) per file type |
PiiAnalyzer |
Runs regex patterns and assigns confidence scores to each match |
EngineFactory |
Loads the correct pattern set for EU (GDPR) or US (HIPAA/CCPA) |
FindsValidator |
Algorithmic validation: DNI mod23, Luhn + IIN prefix, IBAN mod97 |
ConsolePrinter |
Formats and sorts findings by criticality for terminal output |
ScanResults |
Immutable result container with deduplication and occurrence counting |
- SRP β each class has one reason to change. Extraction, analysis, validation, and presentation are fully decoupled.
- OCP β new validators register via
FindsValidator.registerValidator()without modifying existing code. New extractors implementTextExtractorand register inTextExtractorFactory. - Fail-loud β
FileCrawlerthrowsScanExceptionon invalid paths instead of silently returning empty results. Each layer has its own typed exception.
| Region | Data Type | Priority | Status |
|---|---|---|---|
| EU | DNI / NIE | π΄ | β Core Engine + Algorithmic validation |
| US | Social Security Number (SSN) | π΄ | β Core Engine |
| Global | Digital Signatures | π΄ | β Core Engine |
| EU | NIF / CIF / NUSS | π‘ | Planned |
| EU | Passport | π‘ | Planned |
| US | Driver's License | π‘ | Planned |
| US | Tax ID (TIN / EIN) | π‘ | Planned |
| Region | Data Type | Priority | Status |
|---|---|---|---|
| Global | API Keys | π΄ | β Core Engine |
| Global | SSH Private Keys | π΄ | β Core Engine |
| Global | Passwords (assignment patterns, min. 6 chars) | π΄ | β Keyword Engine |
| Global | Connection Strings / DB URLs | π΄ | β Pattern Matcher |
| Global | Session Tokens | π‘ | Planned |
| Region | Data Type | Priority | Status |
|---|---|---|---|
| Global | Card Number (PAN) | π΄ | β Core Engine + Luhn + IIN prefix validation |
| Global | CVV / CVC | π΄ | β Core Engine |
| EU | IBAN | π‘ | β Core Engine + Mod97 validation |
| US | Bank Routing Number | π‘ | β Core Engine |
| EU | SWIFT | π‘ | Planned |
| Global | Crypto Wallet Addresses | π‘ | Planned |
| Region | Data Type | Priority | Status |
|---|---|---|---|
| US | Health Insurance / Policy IDs | π΄ | β Core Engine |
| Global | Biometric Data (binary headers) | π΄ | Planned |
| EU | Medical History patterns | π΄ | Planned |
| US | Lab Results | π‘ | Planned |
| Region | Data Type | Priority | Status |
|---|---|---|---|
| EU | Spanish Phone | π‘ | β Core Engine + Format validation |
| US | US Phone | π‘ | β Core Engine |
| Global | π‘ | β Core Engine + Format validation | |
| Global | IP / MAC Address | π΅ | Planned |
| Global | Geolocation (EXIF) | π΅ | Planned |
| Global | Physical Address | π΅ | Planned |
These are known sources of false positives in v0.1 that will be addressed in Phase 1.5:
- PHONE β 9-digit sequences in binary logs (e.g. VirtualBox
.logfiles) or sequential ID lists can match the Spanish phone pattern. Without surrounding context, these are indistinguishable from real phone numbers at the regex level. - CREDIT_CARD β Numbers that pass the Luhn algorithm by statistical chance may still appear, particularly in networking textbooks that use binary sequences as examples. IIN prefix validation (cards must start with 3, 4, 5 or 6) significantly reduces this, but does not eliminate it entirely.
- No whitelist support yet β there is currently no way to ignore known safe values or specific files. This is the top priority for Phase 1.5.
- Regex engine for High Priority PII (EU & US standards)
- Algorithmic validation (Luhn + IIN prefix, DNI mod23, IBAN mod97)
- Synthetic DNI filter (sequential, alternating and repeated-digit patterns)
- PDF and DOCX text extraction (PDFBox 3.x, Apache POI 5.x)
- Duplicate finding consolidation with occurrence counter
- Findings sorted by criticality type, then file and line
- Formatted console output with HIGH / MEDIUM confidence scoring
- Progress indicator for large directories
- Multithreaded file processing
- Whitelist / ignore patterns (values and files)
- Configurable file size limits
- Metadata Forensics: EXIF GPS, Office/PDF author history
- PII Redaction: mask, delete or replace sensitive findings directly in source files
- JSON and PDF report export (
ReportGenerator) - Custom patterns via JSON/YAML
- Extended pattern library (NIF, CIF, crypto wallets...)
- REST API for CI/CD pipeline integration
- Quarantine system: isolate high-risk files
- Cloud storage support (S3, GDrive)
Contributions are very welcome. Please open an issue before submitting a PR so we can discuss the change.
# 1. Fork the repo
# 2. Create your branch
git checkout -b feature/my-new-pattern
# 3. Commit your changes
git commit -m "feat: add crypto wallet detection pattern"
# 4. Push and open a PR
git push origin feature/my-new-patternNew pattern validators should be added via FindsValidator.registerValidator() and new extractors by implementing TextExtractor β no existing classes need to be modified.
Distributed under the Apache 2.0 License. See LICENSE for more information.