Issue with Text Extraction from PDFs Using spacypdfreader #12975
-
I'm currently working on a project that involves extracting text from PDF documents using the spacypdfreader library in spaCy. However, I've encountered an issue where the extracted text appears to be jumbled, with some words out of order or incorrectly formatted. code:
output: actual pdf text: Distributor shall have no rights to the |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi, spacypdfreader is a third-party package that's not maintained by us, so you might want to check out their repo / forums instead: https://github.com/SamEdwardes/spaCyPDFreader I think this is a general problem for PDF processing, since the text isn't necessarily stored in reading order in the underlying PDF. Possibly using a different PDF parser will improve the results? |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thanks for the reply, I will certainly try using another parser.
regards,
Ishita
…________________________________
From: Adriane Boyd ***@***.***>
Sent: Wednesday, September 13, 2023 11:50 AM
To: explosion/spaCy ***@***.***>
Cc: Ishita Chauhan ***@***.***>; Author ***@***.***>
Subject: Re: [explosion/spaCy] Issue with Text Extraction from PDFs Using spacypdfreader (Discussion #12975)
Hi, spacypdfreader is a third-party package that's not maintained by us, so you might want to check out their repo / forums instead: https://github.com/SamEdwardes/spaCyPDFreader
I think this is a general problem for PDF processing, since the text isn't necessarily stored in reading order in the underlying PDF. Possibly using a different PDF parser will improve the results?
—
Reply to this email directly, view it on GitHub<#12975 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BAP5JK22HNBEJ6IJDXT6YBDX2FGBHANCNFSM6AAAAAA4UTERE4>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hi, spacypdfreader is a third-party package that's not maintained by us, so you might want to check out their repo / forums instead: https://github.com/SamEdwardes/spaCyPDFreader
I think this is a general problem for PDF processing, since the text isn't necessarily stored in reading order in the underlying PDF. Possibly using a different PDF parser will improve the results?