Issue with Text Extraction from PDFs Using spacypdfreader #12975

IshitaChauhanShortHillsAI · 2023-09-12T10:01:54Z

IshitaChauhanShortHillsAI
Sep 12, 2023

I'm currently working on a project that involves extracting text from PDF documents using the spacypdfreader library in spaCy. However, I've encountered an issue where the extracted text appears to be jumbled, with some words out of order or incorrectly formatted.

code:

import spacy
from spacypdfreader.spacypdfreader import pdf_reader
import json
import re

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("data/Informatica LLC v. ACIT.pdf", nlp)

# Get all of the text from a specific PDF page (e.g., page 2).
target_page_number = 6
page_text = " ".join(
    token.text for token in doc if token._.page_number == target_page_number
)

# Remove multiple spaces between text using regular expressions
page_text_cleaned = re.sub(r"\s+", " ", page_text)


# Split the cleaned page text into paragraphs using both ".\n" and "\n\n" as delimiters.

# paragraphs = page_text.split(". \n")
paragraphs = re.split(r"\. \n|\n\n", page_text_cleaned)

# Create a list to store the paragraphs
formatted_paragraphs = []

print(doc, "\n", paragraphs)

# Format and store each paragraph
for i, paragraph in enumerate(paragraphs, start=1):
    formatted_paragraph = paragraph.replace("\n", "").strip()
    formatted_paragraphs.append({"index": i, "content": formatted_paragraph})

# Save the paragraphs as a JSON file
output_json_file = f"new_formatted_paragraphs_{target_page_number}.json"
with open(output_json_file, "w", encoding="utf-8") as json_file:
    json.dump(formatted_paragraphs, json_file, ensure_ascii=False, indent=4)

print(f"Saved {len(formatted_paragraphs)} paragraphs to {output_json_file}")

output:
["... Distributor shall have no rights to the \n software other than the rights expressly set forth in the \n agreement . Distributor shall not modify or copy any part of the \n software or documentation . Distributor may not use sub- \n the software and \n distributors \n documentation without the prior consent of actuate . What is \n charged is the licence fee to be paid by the distributor of the \n software as enumerated in the agreement . Further , clause 6.01 of \n the agreement dealing with title states that the distributor \n acknowledges that actuate and its suppliers retain all rights , title ",
" further distribution of ",
" for ",
"\x0c",
]

actual pdf text:

Distributor shall have no rights to the
software other than the rights expressly set forth in the
agreement. Distributor shall not modify or copy any part of the
software or documentation. Distributor may not use sub-
distributors for further distribution of the software and
documentation without the prior consent of actuate. What is
charged is the licence fee to be paid by the distributor of the
software as enumerated in the agreement. Further, clause 6.01 of
the agreement dealing with title states that the distributor
acknowledges that actuate and its suppliers retain all rights, title

Answered by adrianeboyd

Sep 13, 2023

Hi, spacypdfreader is a third-party package that's not maintained by us, so you might want to check out their repo / forums instead: https://github.com/SamEdwardes/spaCyPDFreader

I think this is a general problem for PDF processing, since the text isn't necessarily stored in reading order in the underlying PDF. Possibly using a different PDF parser will improve the results?

View full answer

adrianeboyd · 2023-09-13T06:19:52Z

adrianeboyd
Sep 13, 2023

Hi, spacypdfreader is a third-party package that's not maintained by us, so you might want to check out their repo / forums instead: https://github.com/SamEdwardes/spaCyPDFreader

I think this is a general problem for PDF processing, since the text isn't necessarily stored in reading order in the underlying PDF. Possibly using a different PDF parser will improve the results?

0 replies

IshitaChauhanShortHillsAI · 2023-09-13T06:27:26Z

IshitaChauhanShortHillsAI
Sep 13, 2023
Author

Hi, Thanks for the reply, I will certainly try using another parser. regards, Ishita

…

________________________________ From: Adriane Boyd ***@***.***> Sent: Wednesday, September 13, 2023 11:50 AM To: explosion/spaCy ***@***.***> Cc: Ishita Chauhan ***@***.***>; Author ***@***.***> Subject: Re: [explosion/spaCy] Issue with Text Extraction from PDFs Using spacypdfreader (Discussion #12975) Hi, spacypdfreader is a third-party package that's not maintained by us, so you might want to check out their repo / forums instead: https://github.com/SamEdwardes/spaCyPDFreader I think this is a general problem for PDF processing, since the text isn't necessarily stored in reading order in the underlying PDF. Possibly using a different PDF parser will improve the results? — Reply to this email directly, view it on GitHub<#12975 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BAP5JK22HNBEJ6IJDXT6YBDX2FGBHANCNFSM6AAAAAA4UTERE4>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue with Text Extraction from PDFs Using spacypdfreader #12975

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Issue with Text Extraction from PDFs Using spacypdfreader #12975

Uh oh!

IshitaChauhanShortHillsAI Sep 12, 2023

Replies: 2 comments

Uh oh!

adrianeboyd Sep 13, 2023

Uh oh!

IshitaChauhanShortHillsAI Sep 13, 2023 Author

IshitaChauhanShortHillsAI
Sep 12, 2023

adrianeboyd
Sep 13, 2023

IshitaChauhanShortHillsAI
Sep 13, 2023
Author