You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be useful to enhance the metadata extraction capabilities of Docling by integrating detection of document stamps (e.g., printed or scanned labels like "Confidentiel Défense", "Medical Data", etc.) to infer the sensitivity level of a document.
Motivation:
Some documents include visible stamps or markings that indicate their confidentiality or sensitivity level. Automatically detecting such stamps during metadata extraction would allow for better compliance with data handling policies, such as:
Preventing further data extraction if a document is marked with high-sensitivity labels (e.g., "Confidentiel Défense", "Medical Data", "Classified", etc.)
Tagging documents appropriately in downstream processes
Ensuring ethical or legal handling of documents
Proposed enhancement:
Implement a module that scans for common sensitivity-related stamps in documents (PDFs, images, OCR output, etc.)
Maintain a configurable list of keywords/phrases indicating restricted content (e.g., "confidential", "secret", "medical", etc.)
Flag or halt extraction processes when such terms are detected
Possibly integrate with existing metadata fields or create a new sensitivity_level metadata field
Benefits:
Better compliance with data protection regulations (e.g., GDPR, HIPAA, national security laws)
Prevent unintentional leaks or processing of restricted content
Add value for users working in government, healthcare, or legal domains
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Description:
It would be useful to enhance the metadata extraction capabilities of Docling by integrating detection of document stamps (e.g., printed or scanned labels like "Confidentiel Défense", "Medical Data", etc.) to infer the sensitivity level of a document.
Motivation:
Some documents include visible stamps or markings that indicate their confidentiality or sensitivity level. Automatically detecting such stamps during metadata extraction would allow for better compliance with data handling policies, such as:
Proposed enhancement:
Benefits:
Happy to discuss implementation ideas
Beta Was this translation helpful? Give feedback.
All reactions