-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
in ByteScout PDF Extractor SDK the default TextExtraction option is PreserveFormattingOnTextExtraction = true
This preserves the approximate layout of the PDF.
The ExtractText feature of MuPDF.net ExtractText does not have this option.
As an example, take this PDF:
statement_sample1.pdf
Running it through the default ExtractText in MuPDF produces:
statement_sample1.mupdf.txt
Rows are broken up, tables are completely removed (look at the Checks Paid section)
While running it through the default Bytescout conversion produces this:
statement_sample1.bytescout.txt
Bytescout code:
using TextExtractor te = new()
{
...
};
te.LoadDocumentFromFile(pdfPath);
te.SaveTextToFile(outputPath);
The layout of the PDF is preserved - the Account summary headings and values remain on the same rows - and the Checks Paid remains in a table layout
Metadata
Metadata
Assignees
Labels
No labels