Skip to content

Implement Bytescout SDK's PreserveFormattingOnTextExtraction feature #166

@margelef

Description

@margelef

in ByteScout PDF Extractor SDK the default TextExtraction option is PreserveFormattingOnTextExtraction = true
This preserves the approximate layout of the PDF.

The ExtractText feature of MuPDF.net ExtractText does not have this option.

As an example, take this PDF:
statement_sample1.pdf

Running it through the default ExtractText in MuPDF produces:
statement_sample1.mupdf.txt

Rows are broken up, tables are completely removed (look at the Checks Paid section)

While running it through the default Bytescout conversion produces this:
statement_sample1.bytescout.txt

Bytescout code:

                using TextExtractor te = new()
                {
...
                };
                te.LoadDocumentFromFile(pdfPath);
                  te.SaveTextToFile(outputPath);

The layout of the PDF is preserved - the Account summary headings and values remain on the same rows - and the Checks Paid remains in a table layout

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions