-
Hi there, In my case, we have some pdfs which are only text-base, and some with the embedded images, and for the ones with the embedded images, we will have to use the PDFOCRtoTextConverter so we dont lost the info presented in the images, but in a general performance evaluation, we see that the QA model performce better (finding the correct answer with a high score) with the PDFtoTextConverter output than the oupput from PDFOCRtoTextConvertor. So wondering 1. could there be some enhancements done with the PDFOCR convertor so it performes better ? 2. is there a way in the haystack to tell if the pdf really needs a ocr convertor or not (i.e. if the pdf contains any non-text part) ? Thanks a lot with the help Brs |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hey @stefanqxb! It really depends a lot on your specific documents, how they look like, etc... You can try to tweak a bit the parameters of PDFOCRToTextConverter, but that's only that much that can be done. If it's not too much of a overhead, you can try to pass the document through both converters: duplicate text is not stored in the document store by default, so you can do the conversion first with PDFtoTextConverter, store the docs, and then do the same with PDFOCRtoTextConvertor. If it manages to extract some new text from the images, such text will be stored, and all the duplicate text will be ignored. Not the best workaround, but you can give it a try and see if it helps! |
Beta Was this translation helpful? Give feedback.
Hey @stefanqxb! It really depends a lot on your specific documents, how they look like, etc... You can try to tweak a bit the parameters of PDFOCRToTextConverter, but that's only that much that can be done. If it's not too much of a overhead, you can try to pass the document through both converters: duplicate text is not stored in the document store by default, so you can do the conversion first with PDFtoTextConverter, store the docs, and then do the same with PDFOCRtoTextConvertor. If it manages to extract some new text from the images, such text will be stored, and all the duplicate text will be ignored.
Not the best workaround, but you can give it a try and see if it helps!