Enhancement: Extract and evaluate sensitivity metadata via stamp detection #1422

flefevre · 2025-04-20T16:02:08Z

flefevre
Apr 20, 2025

Description:

It would be useful to enhance the metadata extraction capabilities of Docling by integrating detection of document stamps (e.g., printed or scanned labels like "Confidentiel Défense", "Medical Data", etc.) to infer the sensitivity level of a document.

Motivation:
Some documents include visible stamps or markings that indicate their confidentiality or sensitivity level. Automatically detecting such stamps during metadata extraction would allow for better compliance with data handling policies, such as:

Preventing further data extraction if a document is marked with high-sensitivity labels (e.g., "Confidentiel Défense", "Medical Data", "Classified", etc.)
Tagging documents appropriately in downstream processes
Ensuring ethical or legal handling of documents

Proposed enhancement:

Implement a module that scans for common sensitivity-related stamps in documents (PDFs, images, OCR output, etc.)
Maintain a configurable list of keywords/phrases indicating restricted content (e.g., "confidential", "secret", "medical", etc.)
Flag or halt extraction processes when such terms are detected
Possibly integrate with existing metadata fields or create a new sensitivity_level metadata field

Benefits:

Better compliance with data protection regulations (e.g., GDPR, HIPAA, national security laws)
Prevent unintentional leaks or processing of restricted content
Add value for users working in government, healthcare, or legal domains

Happy to discuss implementation ideas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Extract and evaluate sensitivity metadata via stamp detection #1422

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Enhancement: Extract and evaluate sensitivity metadata via stamp detection #1422

Uh oh!

flefevre Apr 20, 2025

Replies: 0 comments

flefevre
Apr 20, 2025