Skip to content

add clean_extracted_text method #1651

@Luis-manzur

Description

@Luis-manzur

This method should be called after Doctor has extracted text from the binary content and works to clean up extraction artifacts, formatting issues, or unwanted text that appears in the extracted output. This is typically needed when the extraction process introduces unwanted characters, preserves headers/footers from the original document, or includes metadata that should be removed from the final plain text.

This will help us solve issue #6443 from CL

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

PR'd Issues 🤞

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions