PDFtoTextConvertor and PDFOCRtoTextConverter #3441

stefanqxb · 2022-10-20T10:21:58Z

stefanqxb
Oct 20, 2022

Hi there,
I recently start to using the TextConverter to read the pdf into textual data for the later purposes. During the processing, I found that the outcomes from PDFtoTextConvert and PDFOCRtoTextConvert are quite different, which will have a impact on the downstream QA task.

In my case, we have some pdfs which are only text-base, and some with the embedded images, and for the ones with the embedded images, we will have to use the PDFOCRtoTextConverter so we dont lost the info presented in the images, but in a general performance evaluation, we see that the QA model performce better (finding the correct answer with a high score) with the PDFtoTextConverter output than the oupput from PDFOCRtoTextConvertor.

So wondering 1. could there be some enhancements done with the PDFOCR convertor so it performes better ? 2. is there a way in the haystack to tell if the pdf really needs a ocr convertor or not (i.e. if the pdf contains any non-text part) ? Thanks a lot with the help

Brs
Bin

Answered by ZanSara

Oct 20, 2022

Hey @stefanqxb! It really depends a lot on your specific documents, how they look like, etc... You can try to tweak a bit the parameters of PDFOCRToTextConverter, but that's only that much that can be done. If it's not too much of a overhead, you can try to pass the document through both converters: duplicate text is not stored in the document store by default, so you can do the conversion first with PDFtoTextConverter, store the docs, and then do the same with PDFOCRtoTextConvertor. If it manages to extract some new text from the images, such text will be stored, and all the duplicate text will be ignored.

Not the best workaround, but you can give it a try and see if it helps!

View full answer

ZanSara · 2022-10-20T16:54:23Z

ZanSara
Oct 20, 2022

Hey @stefanqxb! It really depends a lot on your specific documents, how they look like, etc... You can try to tweak a bit the parameters of PDFOCRToTextConverter, but that's only that much that can be done. If it's not too much of a overhead, you can try to pass the document through both converters: duplicate text is not stored in the document store by default, so you can do the conversion first with PDFtoTextConverter, store the docs, and then do the same with PDFOCRtoTextConvertor. If it manages to extract some new text from the images, such text will be stored, and all the duplicate text will be ignored.

Not the best workaround, but you can give it a try and see if it helps!

3 replies

stefanqxb Oct 21, 2022
Author

Hi Sara,
Thanks for the quick response, that is actually my current workaround :) Take the PDFtoText as first choice, otherwise take the PDFOCRtoText as backup. it works with most the cases, so I guess I would live with it for now. But would be good if the PDFOCRtoTextConverter could align more with the PDFtoTextConverter in terms of the text layout etc.

Btw, I might found a small bug during the conversion with the PDFOCRtoTextConvertor, it sometimes create a extra space in the answer.context part which does not exist in the origin Document.context, Im not sure if its due to a specific pdf itself or what, but maybe worthy to take a look.

Brs
Bin

ZanSara Oct 24, 2022

Thank you for the report! If you have a small example PDF that shows this issue, you can open an issue for this bug 👍

stefanqxb Nov 12, 2022
Author

Hi Sara,
Sry for the late reply, Unfornately, due to company regulations, I might not be able to provide u with the example, as this issue was observed by using one of the confidential document from the company, so Im not allow to show it here, as I guess the content and layout of the doc might matter. And just for your info, we only see this problem occured with 2 doc, out of around 70 docs in total :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDFtoTextConvertor and PDFOCRtoTextConverter #3441

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PDFtoTextConvertor and PDFOCRtoTextConverter #3441

Uh oh!

Uh oh!

stefanqxb Oct 20, 2022

Replies: 1 comment · 3 replies

Uh oh!

ZanSara Oct 20, 2022

Uh oh!

Uh oh!

stefanqxb Oct 21, 2022 Author

Uh oh!

ZanSara Oct 24, 2022

Uh oh!

stefanqxb Nov 12, 2022 Author

stefanqxb
Oct 20, 2022

Replies: 1 comment 3 replies

ZanSara
Oct 20, 2022

stefanqxb Oct 21, 2022
Author

stefanqxb Nov 12, 2022
Author