CutterFish 🦑✂️

Dual-Core Security Auditor: Deep Content Inspection (PII) & Metadata Forensics

CutterFish is a modular, open-source security auditing tool that detects sensitive data leaks (PII) across local file systems. It is designed with a Dual-Scan Engine capable of analyzing both visible file content and hidden metadata structures.

⚠️ This project is in active early development (v0.1). The Phase 1 Core Engine is functional. Contributions, issues, and feedback are welcome.

🚀 Key Capabilities

Mode	Description	Status
🔍 Deep Content Scan	Regex and keyword matching to find PII (DNI, SSN, API Keys, passwords...) inside text, logs, source code, PDFs and DOCX files.	✅ Available
🕵️ Metadata Forensics	Extracts hidden data from binary files: GPS in images (EXIF), author history in Office/PDF documents, software traces.	Planned
✂️ PII Redaction	Automatically censors or removes detected sensitive data directly from the source files, with a configurable redaction strategy (mask, delete, replace).	Planned
🌗 Hybrid Mode	Simultaneous content + metadata audit for a 360° security overview.	Planned

🛠 Getting Started

Prerequisites

Java 21+ — Download from Adoptium if not installed

Installation

Linux / Mac / Kali:

git clone https://github.com/valerodev/cutterfish.git
cd cutterfish
sudo ./install.sh

Windows:

git clone https://github.com/valerodev/cutterfish.git
cd cutterfish
install.bat

The installer checks your Java version, copies the JAR to the right location, and registers the cutterfish command system-wide. After installation, open a new terminal and run it from anywhere:

cutterfish

If you prefer not to install, you can also run the JAR directly: java -jar cutterfish.jar

Building from source

Requires Maven 3.8+:

mvn clean package
# Output: target/cutterfish.jar

🖥 Usage

After installation, run from anywhere:

cutterfish

You will be prompted for:

========================================
        🦑 CUTTERFISH v0.1
========================================

Enter target directory path: /path/to/scan
Select compliance standard [EU / US]: EU

Example output

==================================================
  SENSITIVE DATA FOUND
==================================================
  [HIGH]   API_KEY     | Line 3  | .env         | apikey=xK92mPqZ8nR3...
  [HIGH]   PASSWORD    | Line 9  | config.yml   | password=s3cr3tPass! (x2)
  [HIGH]   DNI         | Line 5  | contract.pdf | 74030258C
  [HIGH]   CREDIT_CARD | Line 7  | report.log   | 4539 1488 0343 6467
  [MEDIUM] EMAIL       | Line 12 | users.csv    | admin@example.com (x5)
  [MEDIUM] PHONE       | Line 44 | contacts.txt | 644 557 788

==================================================
  SCAN RESULTS SUMMARY
==================================================
  Files scanned:       134     Duration: 3s
  Files with findings: 4       High confidence: 10
  Total findings:      18      Medium confidence: 8

  FINDINGS BY TYPE:
  API_KEY: 1  |  PASSWORD: 2  |  DNI: 1  |  CREDIT_CARD: 1  |  EMAIL: 9  |  PHONE: 4

🏗 Architecture

cutterfish/
├── cutterfish.jar
├── install.sh
├── install.bat
├── pom.xml
├── README.md
└── src/
    └── main/
        └── java/
            └── cutterfish/
                ├── Main.java
                ├── features/
                │   ├── scanning/
                │   │   ├── ScanManager.java
                │   │   ├── ScanResults.java
                │   │   ├── ScanException.java
                │   │   ├── FileCrawler.java
                │   │   ├── analysis/
                │   │   │   ├── PiiAnalyzer.java
                │   │   │   ├── EngineFactory.java
                │   │   │   ├── FindsValidator.java
                │   │   │   ├── AnalysisException.java
                │   │   │   └── EngineException.java
                │   │   ├── conversion/
                │   │   │   ├── TextExtractor.java
                │   │   │   ├── AbstractTextExtractor.java
                │   │   │   ├── TextExtractorFactory.java
                │   │   │   ├── PlainTextExtractor.java
                │   │   │   ├── PdfExtractor.java
                │   │   │   ├── DocxExtractor.java
                │   │   │   └── ExtractionException.java
                │   │   └── metadata/
                │   ├── reporting/
                │   │   ├── ConsolePrinter.java
                │   │   └── ReportGenerator.java
                │   └── ui/
                │       └── CommandLineInterface.java
                └── shared/
                    └── models/
                        ├── ScanContext.java
                        └── SensitiveData.java

Class	Role
`ScanManager`	Central orchestrator — coordinates crawling, extraction, analysis and reporting
`FileCrawler`	Recursive filesystem traversal with fail-loud error propagation
`TextExtractorFactory`	Selects the right extractor (plain text / PDF / DOCX) per file type
`PiiAnalyzer`	Runs regex patterns and assigns confidence scores to each match
`EngineFactory`	Loads the correct pattern set for EU (GDPR) or US (HIPAA/CCPA)
`FindsValidator`	Algorithmic validation: DNI mod23, Luhn + IIN prefix, IBAN mod97
`ConsolePrinter`	Formats and sorts findings by criticality for terminal output
`ScanResults`	Immutable result container with deduplication and occurrence counting

Design principles

SRP — each class has one reason to change. Extraction, analysis, validation, and presentation are fully decoupled.
OCP — new validators register via FindsValidator.registerValidator() without modifying existing code. New extractors implement TextExtractor and register in TextExtractorFactory.
Fail-loud — FileCrawler throws ScanException on invalid paths instead of silently returning empty results. Each layer has its own typed exception.

🎯 Data Detection Master List

🔴 Phase 1 (current) | 🟡 Phase 1.5 | 🔵 Phase 2

🛡️ 1. Government & Legal Identifiers

Region	Data Type	Priority	Status
EU	DNI / NIE	🔴	✅ Core Engine + Algorithmic validation
US	Social Security Number (SSN)	🔴	✅ Core Engine
Global	Digital Signatures	🔴	✅ Core Engine
EU	NIF / CIF / NUSS	🟡	Planned
EU	Passport	🟡	Planned
US	Driver's License	🟡	Planned
US	Tax ID (TIN / EIN)	🟡	Planned

🔑 2. Credentials & Access Secrets

Region	Data Type	Priority	Status
Global	API Keys	🔴	✅ Core Engine
Global	SSH Private Keys	🔴	✅ Core Engine
Global	Passwords (assignment patterns, min. 6 chars)	🔴	✅ Keyword Engine
Global	Connection Strings / DB URLs	🔴	✅ Pattern Matcher
Global	Session Tokens	🟡	Planned

💰 3. Financial & Wealth Data (PCI-DSS)

Region	Data Type	Priority	Status
Global	Card Number (PAN)	🔴	✅ Core Engine + Luhn + IIN prefix validation
Global	CVV / CVC	🔴	✅ Core Engine
EU	IBAN	🟡	✅ Core Engine + Mod97 validation
US	Bank Routing Number	🟡	✅ Core Engine
EU	SWIFT	🟡	Planned
Global	Crypto Wallet Addresses	🟡	Planned

🏥 4. Health & Biometrics (ePHI / HIPAA)

Region	Data Type	Priority	Status
US	Health Insurance / Policy IDs	🔴	✅ Core Engine
Global	Biometric Data (binary headers)	🔴	Planned
EU	Medical History patterns	🔴	Planned
US	Lab Results	🟡	Planned

🌐 5. Location, Contact & Network

Region	Data Type	Priority	Status
EU	Spanish Phone	🟡	✅ Core Engine + Format validation
US	US Phone	🟡	✅ Core Engine
Global	Email	🟡	✅ Core Engine + Format validation
Global	IP / MAC Address	🔵	Planned
Global	Geolocation (EXIF)	🔵	Planned
Global	Physical Address	🔵	Planned

Known Limitations

These are known sources of false positives in v0.1 that will be addressed in Phase 1.5:

PHONE — 9-digit sequences in binary logs (e.g. VirtualBox .log files) or sequential ID lists can match the Spanish phone pattern. Without surrounding context, these are indistinguishable from real phone numbers at the regex level.
CREDIT_CARD — Numbers that pass the Luhn algorithm by statistical chance may still appear, particularly in networking textbooks that use binary sequences as examples. IIN prefix validation (cards must start with 3, 4, 5 or 6) significantly reduces this, but does not eliminate it entirely.
No whitelist support yet — there is currently no way to ignore known safe values or specific files. This is the top priority for Phase 1.5.

🛣 Roadmap

✅ Phase 1 — Core Engine (current)

Regex engine for High Priority PII (EU & US standards)
Algorithmic validation (Luhn + IIN prefix, DNI mod23, IBAN mod97)
Synthetic DNI filter (sequential, alternating and repeated-digit patterns)
PDF and DOCX text extraction (PDFBox 3.x, Apache POI 5.x)
Duplicate finding consolidation with occurrence counter
Findings sorted by criticality type, then file and line
Formatted console output with HIGH / MEDIUM confidence scoring

Phase 1.5 — Hardening

Progress indicator for large directories
Multithreaded file processing
Whitelist / ignore patterns (values and files)
Configurable file size limits

Phase 2 — Intelligence & Expansion

Metadata Forensics: EXIF GPS, Office/PDF author history
PII Redaction: mask, delete or replace sensitive findings directly in source files
JSON and PDF report export (ReportGenerator)
Custom patterns via JSON/YAML
Extended pattern library (NIF, CIF, crypto wallets...)

Phase 3 — Enterprise & Cloud

REST API for CI/CD pipeline integration
Quarantine system: isolate high-risk files
Cloud storage support (S3, GDrive)

🤝 Contributing

Contributions are very welcome. Please open an issue before submitting a PR so we can discuss the change.

# 1. Fork the repo
# 2. Create your branch
git checkout -b feature/my-new-pattern

# 3. Commit your changes
git commit -m "feat: add crypto wallet detection pattern"

# 4. Push and open a PR
git push origin feature/my-new-pattern

New pattern validators should be added via FindsValidator.registerValidator() and new extractors by implementing TextExtractor — no existing classes need to be modified.

📄 License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CutterFish 🦑✂️

📑 Table of Contents

🚀 Key Capabilities

🛠 Getting Started

Prerequisites

Installation

Building from source

🖥 Usage

Example output

🏗 Architecture

Design principles

🎯 Data Detection Master List

🔴 Phase 1 (current) | 🟡 Phase 1.5 | 🔵 Phase 2

🛡️ 1. Government & Legal Identifiers

🔑 2. Credentials & Access Secrets

💰 3. Financial & Wealth Data (PCI-DSS)

🏥 4. Health & Biometrics (ePHI / HIPAA)

🌐 5. Location, Contact & Network

Known Limitations

🛣 Roadmap

✅ Phase 1 — Core Engine (current)

Phase 1.5 — Hardening

Phase 2 — Intelligence & Expansion

Phase 3 — Enterprise & Cloud

🤝 Contributing

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cutterfish		cutterfish
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cutterfish.jar		cutterfish.jar
install.bat		install.bat
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

CutterFish 🦑✂️

📑 Table of Contents

🚀 Key Capabilities

🛠 Getting Started

Prerequisites

Installation

Building from source

🖥 Usage

Example output

🏗 Architecture

Design principles

🎯 Data Detection Master List

🔴 Phase 1 (current) | 🟡 Phase 1.5 | 🔵 Phase 2

🛡️ 1. Government & Legal Identifiers

🔑 2. Credentials & Access Secrets

💰 3. Financial & Wealth Data (PCI-DSS)

🏥 4. Health & Biometrics (ePHI / HIPAA)

🌐 5. Location, Contact & Network

Known Limitations

🛣 Roadmap

✅ Phase 1 — Core Engine (current)

Phase 1.5 — Hardening

Phase 2 — Intelligence & Expansion

Phase 3 — Enterprise & Cloud

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages