Detect Weird texts which is not properly extractable #2924
Unanswered
Soumadip-Saha
asked this question in
Looking for help
Replies: 1 comment
-
Your garbage example indeed is not made at all for text extraction. text=page.get_text()
text.count(chr(0xfffd))
53 You could of course try to OCR the page in such cases, but I noticed the watermark across the page, so these results may be unpleasant as well. Example page: tp=page.get_textpage_ocr(dpi=150, full=True)
print(page.get_text(textpage=tp))
Independent Final Report
Page1 of 66
Process Parameters of 4400 L-Scale IVIG
Manufacturing Process
TABLE OF CONTENTS
1.0
EXECUTIVE SUMMARY.........cccscessssssrsssssesesesssssesnsesesseeesnseseansetesnsnsessessanscaeacsettansneceeasessearanaeeeensel,
5.0
PROCESS FLOW DIAGRAM........c.sssssssssssssssssesssssessenenssessersserersceransnensasssensesransttenseseeseasesseenanseeeesseD)
6.0
CRITICALITY EVALUATION OF QUALITY ATTRIBUTEG........scssssssssssssrsesssesesessssseensesenseeterseeeeetsO
7.0
PROCESS DESCRIPTION. ......s.scsssssssssssssessseresssssessessessesserseererscersnssensasesenseeransettasseteentesesseereraceeeeTO |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @JorjMcKie. I am having problems in identifying texts which are not properly extractable. Here is a sample of the PDF and I have 1000s of similar type of PDFs. Some of them have proper text which can be extracted and some of them are not. I have tried to extract the text using this method
and the output was this
For other documents this process is working to extract the texts and I would like to skip the pages which has similar kind of garbage output based on a logic. Can you please help me with that? I have attached two sample PDFs which, one garbage PDF and the other Proper PDF. I am a newbie here so it would help me a lot. Thanks a lot in advance.
Beta Was this translation helpful? Give feedback.
All reactions