Skip to content

Commit de202ec

Browse files
authored
blog: sensitive data discovery tool (#954)
1 parent 179a974 commit de202ec

File tree

7 files changed

+99
-0
lines changed

7 files changed

+99
-0
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
title: Top Open Source Sensitive Data Discovery Tools in 2025
3+
author: Tianzhou
4+
updated_at: 2025/12/01 09:00:00
5+
feature_image: /content/blog/top-open-source-sensitive-data-discovery-tools/banner.webp
6+
tags: Industry
7+
description: A guide to open source sensitive data discovery tools - from lightweight CLI scanners like PiiCatcher and Hawk-Eye to full data platforms like OpenMetadata.
8+
keypage: true
9+
---
10+
11+
## Introduction
12+
13+
Sensitive data discovery is the first step in protecting PII, PHI, and other regulated information. Before you can mask, encrypt, or restrict access to sensitive data, you need to find it.
14+
15+
Open source tools for this task span a spectrum:
16+
17+
- **NLP libraries** - Building blocks for custom detection pipelines
18+
- **Lightweight CLI scanners** - Quick, targeted scans for developers and CI pipelines
19+
- **Full data platforms** - Comprehensive metadata management with classification as one feature
20+
21+
This article covers four tools across this spectrum, from NLP foundations to enterprise data catalogs with built-in classification.
22+
23+
## spaCy
24+
25+
[spaCy](https://github.com/explosion/spaCy) is the industrial-strength NLP library that powers many sensitive data detection tools.
26+
27+
![spaCy](/content/blog/top-open-source-sensitive-data-discovery-tools/spacy.webp)
28+
29+
spaCy provides named entity recognition (NER) that can identify persons, organizations, locations, and other entity types in text. Both PiiCatcher and OpenMetadata use spaCy under the hood for ML-based PII detection. If you need maximum flexibility, you can build custom detection pipelines directly with spaCy, though the tools below provide ready-to-use solutions.
30+
31+
## PiiCatcher
32+
33+
[PiiCatcher](https://github.com/tokern/piicatcher) is a focused CLI scanner that detects PII in databases and tags findings in data catalogs.
34+
35+
![piicatcher](/content/blog/top-open-source-sensitive-data-discovery-tools/piicatcher.webp)
36+
37+
**Detection approach:** PiiCatcher uses two methods - regex pattern matching against column names, and NLP-based analysis of sample data using spaCy. This dual approach catches both obviously-named columns (e.g., `email`, `ssn`) and columns with generic names but sensitive content.
38+
39+
**Data source support:** PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQuery.
40+
41+
**Key strength:** Native integration with data catalogs. PiiCatcher can automatically tag discovered PII in DataHub or Amundsen, bridging the gap between standalone scanning and catalog-based governance.
42+
43+
**Best for:** Teams wanting a lightweight scanner that feeds into their existing data catalog.
44+
45+
## Hawk-Eye
46+
47+
[Hawk-Eye](https://github.com/rohitcoder/hawk-eye) is a broad-spectrum scanner covering databases, cloud storage, and files - including images and videos via OCR.
48+
49+
![hawk-eye](/content/blog/top-open-source-sensitive-data-discovery-tools/hawk-eye.webp)
50+
51+
**Detection approach:** Pattern matching with configurable fingerprints defined in YAML. Supports OCR for images and documents (350+ file types including DOCX, PDF, images, videos).
52+
53+
**Data source support:** MySQL, PostgreSQL, MongoDB, CouchDB, Redis, S3, Google Cloud Storage, Firebase, Slack, Google Drive, local filesystem.
54+
55+
**Key strength:** Breadth of coverage. Unlike database-only scanners, Hawk-Eye finds PII across your entire data footprint - useful when sensitive data leaks into unstructured storage.
56+
57+
**Best for:** Security teams auditing diverse data sources beyond just databases.
58+
59+
## OpenMetadata
60+
61+
[OpenMetadata](https://github.com/open-metadata/OpenMetadata) is a unified metadata platform with auto-classification as a core governance feature.
62+
63+
![openmetadata](/content/blog/top-open-source-sensitive-data-discovery-tools/open-metadata.webp)
64+
65+
**Detection approach:** Auto-Classification workflow powered by spaCy with configurable confidence levels (0-100). The system identifies PII and either auto-applies tags or suggests them for review. Runs as a separate workflow from metadata ingestion, so you can tune classification independently.
66+
67+
**Data source support:** 84+ connectors spanning databases, dashboards, messaging, and pipelines.
68+
69+
**Key strength:** Tight integration between classification and governance workflows. Tags flow into data quality rules, access policies, and team collaboration features. The no-code profiler makes classification accessible to non-engineers.
70+
71+
**Best for:** Teams wanting a modern, API-first platform where classification drives downstream governance policies.
72+
73+
**Alternative:** [DataHub](https://github.com/datahub-project/datahub) is another open-source metadata platform, but its auto-classification feature only supports Snowflake and has been [marked as deprecated](https://docs.datahub.com/docs/metadata-ingestion/docs/dev_guides/classification). If you're using DataHub, consider pairing it with PiiCatcher for broader classification coverage.
74+
75+
## Comparison
76+
77+
| Tool | Language | Primary Use Case | Detection Method | Data Source Support | Deployment | License |
78+
| ---------------- | ------------- | ---------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------------------- | ------------------ | ------------------------- |
79+
| **spaCy** | Python | NLP library / building block | Named entity recognition (NER), ML models | N/A (text processing only) | pip | MIT |
80+
| **PiiCatcher** | Python | CLI scanner for databases | Regex + NLP (spaCy) | PostgreSQL, MySQL, SQLite, Redshift, Athena, Snowflake, BigQuery | pip, Docker | Apache 2.0 |
81+
| **Hawk-Eye** | Python | Multi-source scanner (DBs, cloud, files) | Pattern matching + OCR | MySQL, PostgreSQL, MongoDB, Redis, S3, GCS, Firebase, Slack | pip, Docker | LGPL 2.1 + Commons Clause |
82+
| **OpenMetadata** | Java / Python | Data platform with governance | Auto-classification workflow (spaCy), confidence thresholds | 84+ connectors | Docker, Kubernetes | Apache 2.0 |
83+
84+
![star-history](/content/blog/top-open-source-sensitive-data-discovery-tools/star-history.webp)
85+
86+
Choose based on your needs:
87+
88+
- **spaCy** - Build custom detection pipelines with maximum flexibility
89+
- **PiiCatcher** - Catalog-integrated database scanning
90+
- **Hawk-Eye** - Broad coverage across databases, cloud storage, and files
91+
- **OpenMetadata** - Classification within a full metadata platform
92+
93+
Start lightweight, then graduate to full platforms as governance requirements grow.
94+
95+
## From Discovery to Protection
96+
97+
Finding sensitive data is only half the challenge - you then need to protect it. [Bytebase provides dynamic data masking](https://docs.bytebase.com/security/data-masking/overview) that can be driven by classification results.
98+
99+
Once you've identified PII columns using tools above, you can call the Bytebase REST/gRPC API to apply masking policies programmatically. This creates an automated pipeline: scan → classify → mask, ensuring discovered sensitive data is protected without manual intervention.
28.9 KB
Loading
291 KB
Loading
70.7 KB
Loading
63.8 KB
Loading
128 KB
Loading
119 KB
Loading

0 commit comments

Comments
 (0)