Skip to content
Discussion options

You must be logged in to vote

Hey @stefanqxb! It really depends a lot on your specific documents, how they look like, etc... You can try to tweak a bit the parameters of PDFOCRToTextConverter, but that's only that much that can be done. If it's not too much of a overhead, you can try to pass the document through both converters: duplicate text is not stored in the document store by default, so you can do the conversion first with PDFtoTextConverter, store the docs, and then do the same with PDFOCRtoTextConvertor. If it manages to extract some new text from the images, such text will be stored, and all the duplicate text will be ignored.

Not the best workaround, but you can give it a try and see if it helps!

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@stefanqxb
Comment options

@ZanSara
Comment options

@stefanqxb
Comment options

Answer selected by stefanqxb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants