Skip to content

valerodev/cutterfish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CutterFish πŸ¦‘βœ‚οΈ

Dual-Core Security Auditor: Deep Content Inspection (PII) & Metadata Forensics

License Stage Java PRs Welcome

CutterFish is a modular, open-source security auditing tool that detects sensitive data leaks (PII) across local file systems. It is designed with a Dual-Scan Engine capable of analyzing both visible file content and hidden metadata structures.

⚠️ This project is in active early development (v0.1). The Phase 1 Core Engine is functional. Contributions, issues, and feedback are welcome.


πŸ“‘ Table of Contents


πŸš€ Key Capabilities

Mode Description Status
πŸ” Deep Content Scan Regex and keyword matching to find PII (DNI, SSN, API Keys, passwords...) inside text, logs, source code, PDFs and DOCX files. βœ… Available
πŸ•΅οΈ Metadata Forensics Extracts hidden data from binary files: GPS in images (EXIF), author history in Office/PDF documents, software traces. Planned
βœ‚οΈ PII Redaction Automatically censors or removes detected sensitive data directly from the source files, with a configurable redaction strategy (mask, delete, replace). Planned
πŸŒ— Hybrid Mode Simultaneous content + metadata audit for a 360Β° security overview. Planned

πŸ›  Getting Started

Prerequisites

Installation

Linux / Mac / Kali:

git clone https://github.com/valerodev/cutterfish.git
cd cutterfish
sudo ./install.sh

Windows:

git clone https://github.com/valerodev/cutterfish.git
cd cutterfish
install.bat

The installer checks your Java version, copies the JAR to the right location, and registers the cutterfish command system-wide. After installation, open a new terminal and run it from anywhere:

cutterfish

If you prefer not to install, you can also run the JAR directly: java -jar cutterfish.jar

Building from source

Requires Maven 3.8+:

mvn clean package
# Output: target/cutterfish.jar

πŸ–₯ Usage

After installation, run from anywhere:

cutterfish

You will be prompted for:

========================================
        πŸ¦‘ CUTTERFISH v0.1
========================================

Enter target directory path: /path/to/scan
Select compliance standard [EU / US]: EU

Example output

==================================================
  SENSITIVE DATA FOUND
==================================================
  [HIGH]   API_KEY     | Line 3  | .env         | apikey=xK92mPqZ8nR3...
  [HIGH]   PASSWORD    | Line 9  | config.yml   | password=s3cr3tPass! (x2)
  [HIGH]   DNI         | Line 5  | contract.pdf | 74030258C
  [HIGH]   CREDIT_CARD | Line 7  | report.log   | 4539 1488 0343 6467
  [MEDIUM] EMAIL       | Line 12 | users.csv    | admin@example.com (x5)
  [MEDIUM] PHONE       | Line 44 | contacts.txt | 644 557 788

==================================================
  SCAN RESULTS SUMMARY
==================================================
  Files scanned:       134     Duration: 3s
  Files with findings: 4       High confidence: 10
  Total findings:      18      Medium confidence: 8

  FINDINGS BY TYPE:
  API_KEY: 1  |  PASSWORD: 2  |  DNI: 1  |  CREDIT_CARD: 1  |  EMAIL: 9  |  PHONE: 4

πŸ— Architecture

cutterfish/
β”œβ”€β”€ cutterfish.jar
β”œβ”€β”€ install.sh
β”œβ”€β”€ install.bat
β”œβ”€β”€ pom.xml
β”œβ”€β”€ README.md
└── src/
    └── main/
        └── java/
            └── cutterfish/
                β”œβ”€β”€ Main.java
                β”œβ”€β”€ features/
                β”‚   β”œβ”€β”€ scanning/
                β”‚   β”‚   β”œβ”€β”€ ScanManager.java
                β”‚   β”‚   β”œβ”€β”€ ScanResults.java
                β”‚   β”‚   β”œβ”€β”€ ScanException.java
                β”‚   β”‚   β”œβ”€β”€ FileCrawler.java
                β”‚   β”‚   β”œβ”€β”€ analysis/
                β”‚   β”‚   β”‚   β”œβ”€β”€ PiiAnalyzer.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ EngineFactory.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ FindsValidator.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ AnalysisException.java
                β”‚   β”‚   β”‚   └── EngineException.java
                β”‚   β”‚   β”œβ”€β”€ conversion/
                β”‚   β”‚   β”‚   β”œβ”€β”€ TextExtractor.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ AbstractTextExtractor.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ TextExtractorFactory.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ PlainTextExtractor.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ PdfExtractor.java
                β”‚   β”‚   β”‚   β”œβ”€β”€ DocxExtractor.java
                β”‚   β”‚   β”‚   └── ExtractionException.java
                β”‚   β”‚   └── metadata/
                β”‚   β”œβ”€β”€ reporting/
                β”‚   β”‚   β”œβ”€β”€ ConsolePrinter.java
                β”‚   β”‚   └── ReportGenerator.java
                β”‚   └── ui/
                β”‚       └── CommandLineInterface.java
                └── shared/
                    └── models/
                        β”œβ”€β”€ ScanContext.java
                        └── SensitiveData.java
Class Role
ScanManager Central orchestrator β€” coordinates crawling, extraction, analysis and reporting
FileCrawler Recursive filesystem traversal with fail-loud error propagation
TextExtractorFactory Selects the right extractor (plain text / PDF / DOCX) per file type
PiiAnalyzer Runs regex patterns and assigns confidence scores to each match
EngineFactory Loads the correct pattern set for EU (GDPR) or US (HIPAA/CCPA)
FindsValidator Algorithmic validation: DNI mod23, Luhn + IIN prefix, IBAN mod97
ConsolePrinter Formats and sorts findings by criticality for terminal output
ScanResults Immutable result container with deduplication and occurrence counting

Design principles

  • SRP β€” each class has one reason to change. Extraction, analysis, validation, and presentation are fully decoupled.
  • OCP β€” new validators register via FindsValidator.registerValidator() without modifying existing code. New extractors implement TextExtractor and register in TextExtractorFactory.
  • Fail-loud β€” FileCrawler throws ScanException on invalid paths instead of silently returning empty results. Each layer has its own typed exception.

🎯 Data Detection Master List

πŸ”΄ Phase 1 (current) | 🟑 Phase 1.5 | πŸ”΅ Phase 2

πŸ›‘οΈ 1. Government & Legal Identifiers

Region Data Type Priority Status
EU DNI / NIE πŸ”΄ βœ… Core Engine + Algorithmic validation
US Social Security Number (SSN) πŸ”΄ βœ… Core Engine
Global Digital Signatures πŸ”΄ βœ… Core Engine
EU NIF / CIF / NUSS 🟑 Planned
EU Passport 🟑 Planned
US Driver's License 🟑 Planned
US Tax ID (TIN / EIN) 🟑 Planned

πŸ”‘ 2. Credentials & Access Secrets

Region Data Type Priority Status
Global API Keys πŸ”΄ βœ… Core Engine
Global SSH Private Keys πŸ”΄ βœ… Core Engine
Global Passwords (assignment patterns, min. 6 chars) πŸ”΄ βœ… Keyword Engine
Global Connection Strings / DB URLs πŸ”΄ βœ… Pattern Matcher
Global Session Tokens 🟑 Planned

πŸ’° 3. Financial & Wealth Data (PCI-DSS)

Region Data Type Priority Status
Global Card Number (PAN) πŸ”΄ βœ… Core Engine + Luhn + IIN prefix validation
Global CVV / CVC πŸ”΄ βœ… Core Engine
EU IBAN 🟑 βœ… Core Engine + Mod97 validation
US Bank Routing Number 🟑 βœ… Core Engine
EU SWIFT 🟑 Planned
Global Crypto Wallet Addresses 🟑 Planned

πŸ₯ 4. Health & Biometrics (ePHI / HIPAA)

Region Data Type Priority Status
US Health Insurance / Policy IDs πŸ”΄ βœ… Core Engine
Global Biometric Data (binary headers) πŸ”΄ Planned
EU Medical History patterns πŸ”΄ Planned
US Lab Results 🟑 Planned

🌐 5. Location, Contact & Network

Region Data Type Priority Status
EU Spanish Phone 🟑 βœ… Core Engine + Format validation
US US Phone 🟑 βœ… Core Engine
Global Email 🟑 βœ… Core Engine + Format validation
Global IP / MAC Address πŸ”΅ Planned
Global Geolocation (EXIF) πŸ”΅ Planned
Global Physical Address πŸ”΅ Planned

Known Limitations

These are known sources of false positives in v0.1 that will be addressed in Phase 1.5:

  • PHONE β€” 9-digit sequences in binary logs (e.g. VirtualBox .log files) or sequential ID lists can match the Spanish phone pattern. Without surrounding context, these are indistinguishable from real phone numbers at the regex level.
  • CREDIT_CARD β€” Numbers that pass the Luhn algorithm by statistical chance may still appear, particularly in networking textbooks that use binary sequences as examples. IIN prefix validation (cards must start with 3, 4, 5 or 6) significantly reduces this, but does not eliminate it entirely.
  • No whitelist support yet β€” there is currently no way to ignore known safe values or specific files. This is the top priority for Phase 1.5.

πŸ›£ Roadmap

βœ… Phase 1 β€” Core Engine (current)

  • Regex engine for High Priority PII (EU & US standards)
  • Algorithmic validation (Luhn + IIN prefix, DNI mod23, IBAN mod97)
  • Synthetic DNI filter (sequential, alternating and repeated-digit patterns)
  • PDF and DOCX text extraction (PDFBox 3.x, Apache POI 5.x)
  • Duplicate finding consolidation with occurrence counter
  • Findings sorted by criticality type, then file and line
  • Formatted console output with HIGH / MEDIUM confidence scoring

Phase 1.5 β€” Hardening

  • Progress indicator for large directories
  • Multithreaded file processing
  • Whitelist / ignore patterns (values and files)
  • Configurable file size limits

Phase 2 β€” Intelligence & Expansion

  • Metadata Forensics: EXIF GPS, Office/PDF author history
  • PII Redaction: mask, delete or replace sensitive findings directly in source files
  • JSON and PDF report export (ReportGenerator)
  • Custom patterns via JSON/YAML
  • Extended pattern library (NIF, CIF, crypto wallets...)

Phase 3 β€” Enterprise & Cloud

  • REST API for CI/CD pipeline integration
  • Quarantine system: isolate high-risk files
  • Cloud storage support (S3, GDrive)

🀝 Contributing

Contributions are very welcome. Please open an issue before submitting a PR so we can discuss the change.

# 1. Fork the repo
# 2. Create your branch
git checkout -b feature/my-new-pattern

# 3. Commit your changes
git commit -m "feat: add crypto wallet detection pattern"

# 4. Push and open a PR
git push origin feature/my-new-pattern

New pattern validators should be added via FindsValidator.registerValidator() and new extractors by implementing TextExtractor β€” no existing classes need to be modified.


πŸ“„ License

Distributed under the Apache 2.0 License. See LICENSE for more information.

About

Java-based security auditing tool designed to detect and sanitize sensitive data leaks (PII) in local file systems. Regex-based deep scanning, and data masking for secure logging.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors