Generating new PDF from another PDF with text only #1527
-
Hello, Looking to extract only text from a PDF and save the same as a new PDF. Found some relevant methods in the package, ie: Here's a sample working code snippet: import fitz
doc = fitz.open("pdfs/RPHT24.pdf")
page = doc[0]
textBlocks = page.get_text_blocks(0)
# open output PDF
newdoc = fitz.open()
# output page with same dimensions as input
newpage = newdoc.new_page(width=page.rect.width, height=page.rect.height)
blue = (0, 0, 1)
for textBlock in textBlocks:
r = fitz.Rect(textBlock[0], textBlock[1], textBlock[2], textBlock[3])
newpage.insert_text(fitz.Point(textBlock[0], textBlock[1]), textBlock[4], color=blue)
newdoc.save("x.pdf") Are there any other code samples that helps in rendering the text with full formatting and better positioning? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 58 replies
-
Sure! To see some code that actively uses this together with "TextWriter", look at this script. It is part of the font replacement utilities. |
Beta Was this translation helpful? Give feedback.
-
import fitz
doc = fitz.open("pdfs/RPHT24.pdf")
page = doc[0]
# open output PDF
newDocument = fitz.open()
# output page with same dimensions as input
newPage = newDocument.new_page(width=page.rect.width, height=page.rect.height)
font = fitz.Font("tibo")
tw = fitz.TextWriter(page.rect, color=(1, 0, 0))
# Get text blocks.
blocks = page.get_text("dict")["blocks"]
for textBlock in blocks:
if "lines" not in textBlock:
continue
lines = textBlock["lines"]
for line in lines:
for span in line["spans"]:
textBox = span['bbox']
text = span['text']
# tw.append(span["origin"], text, font=font, fontsize=span["size"])
tw.fill_textbox( # fill in above text
textBox, # keep text inside this
text, # the text
align=fitz.TEXT_ALIGN_LEFT, # alignment
warn=True, # keep going if too much text
fontsize=span["size"],
font=font,
)
outcolor = fitz.sRGB_to_pdf(span["color"]) # recover (r,g,b)
tw.write_text(page, color=outcolor)
newDocument.save("output/textBoundaries.pdf") The purpose of this script is to generate a new PDF with only the text. Had built the above script based on the samples and examples you created in the PyMuPDF utility libraries. However, due to some reason, it is not working as expected. Use the below two code chunks for writing the text in the PDF, both did not work. tw.append(span["origin"], text, font=font, fontsize=span["size"]) tw.fill_textbox( # fill in above text
textBox, # keep text inside this
text, # the text
align=fitz.TEXT_ALIGN_LEFT, # alignment
warn=True, # keep going if too much text
fontsize=span["size"],
font=font,
) Yet to find anything in the documentation. @JorjMcKie can you point and help in resolving what could be wrong here? |
Beta Was this translation helpful? Give feedback.
-
when looking at your code snippet: you want to write on a new, separate page, right? |
Beta Was this translation helpful? Give feedback.
-
just looked again at FontForge: you can use that for font conversion, too. |
Beta Was this translation helpful? Give feedback.
-
I have reviewed again why spans are misplaced in some occasions, but not in others, and found another small wrinkle that is causing this. |
Beta Was this translation helpful? Give feedback.
-
PyMuPDF-1.19.5-cp38-cp38-win_amd64.zip |
Beta Was this translation helpful? Give feedback.
-
You can now do your tests on WIndows or Linux |
Beta Was this translation helpful? Give feedback.
-
Have you checked whether this error message is actually true, e.g. via some other ZIP program? 7zip? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
I have reviewed again why spans are misplaced in some occasions, but not in others, and found another small wrinkle that is causing this.
After implementing the change, your two example files PHT23.pdf / PHT22.pdf are now both processed correctly just using the "dict" option.
I will create a set of pre-wheels and will let you know when they are done.