I need to find and match two or three keywords in a pdf page and extract that pdf_page from the pdf document, #2267
Unanswered
sucanthudu
asked this question in
Q&A
Replies: 2 comments 1 reply
-
This is a "Discussions" item not an issue, so let us first convert it. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I am not sure I completely understand your desired page selection logic. import fitz
kw_list = ['Profit', 'Loss', 'Income', 'Expense', 'Savings']
doc = fitz.open("input.pdf") # input pdf
out = fitz.open() # output PDF with selected pages
for page in doc: # iterate through inpout pages
text = page.get_text() # extract plain page text
found = False # switch indicating any of keyword is on the page
for kw in kw_list:
if kw in text: # kw exists in the text
found = True
if found:
out.insert_pdf(doc, from_page=page.number, to_page=page.number)
# if the output PDF contains any page, save it
if out.page_count > 0:
out.save("found.pdf") |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I need to find and match two or three keywords in a pdf page and extract that pdf_page from the pdf document, please kindly help me out in solving this
keyword_list = ['Profit','Loss', 'Income','Expense,'Savings']
For example: pdf page1 will contain Profit and Loss, now using these 'Profit' and 'Loss' two keywords, i need to extract pdf_page1.
pdf page 2 will contain Income, Expense and Savings, now using all these 'Income' and 'Expense' and 'Savings three keywords 'i need to extract pdf_page2.
Like this i have bag of words pattern for each page based on the bag of words set pattern i need to extract pages. please help me out in solving this.
please suggest.
pdf_document = fitz.open(pdf_file_path)
keyword_list_set = 'Profit' and 'Loss', 'Income' and 'Expense' and 'Savings'
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(keyword_list_set):
pages.append(this_page)
pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open(pdf_output_path, 'wb') as f:
pdfWriter.write(f)
f.close()
Beta Was this translation helpful? Give feedback.
All reactions