Generating new PDF from another PDF with text only #1527

qwertynik · 2022-01-11T14:23:17Z

qwertynik
Jan 11, 2022

Hello,

Looking to extract only text from a PDF and save the same as a new PDF. Found some relevant methods in the package, ie: insert_text, TextWriter, however, unable to use them.

Here's a sample working code snippet:

import fitz

doc = fitz.open("pdfs/RPHT24.pdf")
page = doc[0]
textBlocks = page.get_text_blocks(0)

# open output PDF
newdoc = fitz.open()
# output page with same dimensions as input
newpage = newdoc.new_page(width=page.rect.width, height=page.rect.height)

blue = (0, 0, 1)

for textBlock in textBlocks:
    r = fitz.Rect(textBlock[0], textBlock[1], textBlock[2], textBlock[3])
    newpage.insert_text(fitz.Point(textBlock[0], textBlock[1]), textBlock[4], color=blue)

newdoc.save("x.pdf")

Are there any other code samples that helps in rendering the text with full formatting and better positioning?

Thanks!

Answered by JorjMcKie

Jan 20, 2022

I have reviewed again why spans are misplaced in some occasions, but not in others, and found another small wrinkle that is causing this.
After implementing the change, your two example files PHT23.pdf / PHT22.pdf are now both processed correctly just using the "dict" option.
I will create a set of pre-wheels and will let you know when they are done.

View full answer

JorjMcKie · 2022-01-11T15:28:26Z

JorjMcKie
Jan 11, 2022
Maintainer

Sure!
Do not use page.get_text("blocks") - this is a high-speed method not targeted for your purpose.
Instead use page.get_text("dict"). This is a dictionary of nested dictionaries containing full info for each text piece (called a "span") including fontname, size, color, position and text orientation.
Study the dict structure here.

To see some code that actively uses this together with "TextWriter", look at this script. It is part of the font replacement utilities.

3 replies

qwertynik Jan 11, 2022
Author

Thanks for your prompt response @JorjMcKie

Had earlier come across page.get_text("dict"), however did not use it due to its apparently 'complex' structure. Looked at this script and also attempted to use it to see the output once. But font-related info for replacement needs to be provided.

Meanwhile, it would be helpful if you could point to a relatively simpler script.

JorjMcKie Jan 11, 2022
Maintainer

I think your intention by its very nature implies the better part of the perceived complexity.
If you want to regenerate the original's text on some other PDF page you have to look at all those things:

make available all the original fonts to TextWriter
insert text at correct positions, using correct size, orientation and color
if some original font is not supported by TextWriter (and this does happen!) use some appropriate fallbacks
mixing different text colors requires using different TextWriter objects
similar is true if text is not horizontal
etc., etc.

So I believe you must walk through all that mud if you want to do this kind of thing.
Try to adapt that script by assuming that every font is being replaced by itself or similar.

qwertynik Jan 12, 2022
Author

Certainly @JorjMcKie, there is a lot that needs to be done to ensure that the text is rendered correctly.
Will grow through the example.

So I believe you must walk through all that mud if you want to do this kind of thing.

It is most likely this mud which will turn out to be gold😀

Try to adapt that script by assuming that every font is being replaced by itself or similar.

Yes, had planned to do so.

qwertynik · 2022-01-17T14:40:26Z

qwertynik
Jan 17, 2022
Author

import fitz

doc = fitz.open("pdfs/RPHT24.pdf")
page = doc[0]

# open output PDF
newDocument = fitz.open()
# output page with same dimensions as input
newPage = newDocument.new_page(width=page.rect.width, height=page.rect.height)

font = fitz.Font("tibo") 
tw = fitz.TextWriter(page.rect, color=(1, 0, 0))

# Get text blocks.
blocks = page.get_text("dict")["blocks"]
for textBlock in blocks:
    if "lines" not in textBlock:
        continue

    lines = textBlock["lines"]

    for line in lines:
        for span in line["spans"]:
            textBox = span['bbox']
            text = span['text']

            # tw.append(span["origin"], text, font=font, fontsize=span["size"])
            tw.fill_textbox(  # fill in above text
                textBox,  # keep text inside this
                text,  # the text
                align=fitz.TEXT_ALIGN_LEFT,  # alignment
                warn=True,  # keep going if too much text
                fontsize=span["size"],
                font=font,
            )
            outcolor = fitz.sRGB_to_pdf(span["color"])  # recover (r,g,b)
            tw.write_text(page, color=outcolor)

newDocument.save("output/textBoundaries.pdf")

The purpose of this script is to generate a new PDF with only the text. Had built the above script based on the samples and examples you created in the PyMuPDF utility libraries. However, due to some reason, it is not working as expected.

Use the below two code chunks for writing the text in the PDF, both did not work.

             tw.append(span["origin"], text, font=font, fontsize=span["size"])

            tw.fill_textbox(  # fill in above text
                textBox,  # keep text inside this
                text,  # the text
                align=fitz.TEXT_ALIGN_LEFT,  # alignment
                warn=True,  # keep going if too much text
                fontsize=span["size"],
                font=font,
            )

Yet to find anything in the documentation. @JorjMcKie can you point and help in resolving what could be wrong here?

5 replies

JorjMcKie Jan 17, 2022
Maintainer

you did not mention what exactly went wrong

JorjMcKie Jan 17, 2022
Maintainer

another point:
if you want to replicate input's text on some other page, using the textbox is not the best idea, because you cannot (easily) change font, fontsize, color, etc.
I would - in that situation - use the normal append method. That's challenge enough. Because you need a separate TextWriter - at least for each color, potentially also for different writing angles and what not.
But if you use the get_text("dict") as input, you find every information to be successful.

qwertynik Jan 18, 2022
Author

@JorjMcKie

Sorry for the incomplete information. The issue is, no text is being rendered in the PDF. When using the fill_textbox which as you say is not the best idea for this case, this warning Warning: Only fitting 0 of 1 lines. is emitted. But anyways wouldn't use this method now.

Any ideas on how to get the append method working?

JorjMcKie Jan 18, 2022
Maintainer

The warning means what it actually says: given the rectangle, the fontsize, the amount of text, ... then only x lines will fit where y lines have been generated that should be written. So enlarge the rectangle, decrease the fontsize, reduce the text amount, etc.

Let me ask a silly question: if no text is rendered ... you did not forget to execute TextWriter.write_text(...) in the end, did you?

qwertynik Jan 18, 2022
Author

The warning means what it actually says: given the rectangle, the fontsize, the amount of text

Yes, had understood the warning. However, wanted to see how append can be used.

TextWriter.write_text(...) - No, I did execute this. It is being executed for every span @JorjMcKie.

JorjMcKie · 2022-01-18T07:23:42Z

JorjMcKie
Jan 18, 2022
Maintainer

when looking at your code snippet: you want to write on a new, separate page, right?
then your tw.write_text() should go to the newPage or shouldn't it?

22 replies

qwertynik Jan 19, 2022
Author

Our discussion about character sequence for right-to-left spans was irrelevant: all spans simply contains their characters, and that's it. How they must be read is outside this.

Hmm, ok. Good that we know of it now.

Then I forgot to mention that PyMuPDF has the global option that causes subset fontnames to be returned. Which allows extraction of respective fontbuffers

Attempted running the script in one of pdfs, however, the rendering of the text is broken.

Fonts extracted from the PDF: ['TXPGHF+Arial-BoldMT', 'TXPGHF+ArialMT']

This works often times - but not always, e.g. not for Type 3 fonts, for which the whole approach with TextWriter won't work at all

Type 3 fonts are anyway severely outdated from what I remember.

The rewriter script and the script here both work as expected.

qwertynik Jan 19, 2022
Author

@JorjMcKie Further experimented with text rendering styles and found some instances where the rendering is off.

Input PDF: PHT23.pdf
Output PDF: x.pdf

While the usage of such rendering style would be rare, there are chances of such occurences - probably MuPDF does not support this.

qwertynik Jan 19, 2022
Author

@JorjMcKie

Regarding Bounding Boxes:

The bounding boxes for span parsed when using rawdict are different from when using dict for the Arabic text. Not sure why - either it's a bug, or, incorrect expectations.

When using rawdict

When using dict (broken text rendering can be ignored here - will discuss this in another comment):

Script used:

import fitz

fileName = "RPHT24"
fileName = "PHT22"
fileName = "PHT23"
doc = fitz.open("pdfs/%s.pdf" % fileName)
page = doc[0]
ndoc = fitz.open()
npage = ndoc.new_page(width=page.rect.width, height=page.rect.height)
extra_flags = fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE
blocks = page.get_text("rawdict")["blocks"]
helv = fitz.Font("helv")
arial = fitz.Font(fontfile="C:/Windows/Fonts/arial.ttf")
for b in blocks:
    if "lines" not in b:
        continue

    for l in b["lines"]:
        cos, sin = l["dir"]
        matrix = fitz.Matrix(cos, -sin, sin, cos, 0, 0)
        for s in l["spans"]:
            textBox = s['bbox']
            fsize = s["size"]
            fname = s["font"]
            for c in s["chars"]:
                if fname.lower().startswith("arial"):
                    font = arial
                else:
                    font = helv
                ch = c["c"]
                origin = fitz.Point(c["origin"])
                tw = fitz.TextWriter(page.rect)
                tw.append(origin, ch, font=font, fontsize=fsize)
                tw.write_text(npage, morph=(origin, matrix))

        shape = npage.new_shape()
        shape.draw_rect(textBox)
        shape.finish(
            fill=None,  # fill color
            color=(0, 0, 1),  # line color
        )
        shape.commit()

npage.clean_contents()  # recommended
ndoc.subset_fonts()  # recommended
ndoc.ez_save("x.pdf", garbage=4)  # garbage=4 recommended!!

JorjMcKie Jan 19, 2022
Maintainer

Those are good examples! Helped me to identify a logic error in creating the matrix for the TextWriter.
My incorrect assumption was that every text orientation can be expressed by a rotation - and this did not cover the flipping contained in PHT23.pdf.
This is now taken care of here:
rewriter.zip

qwertynik Jan 20, 2022
Author

Tested this @JorjMcKie. So the script now supports text rendering for all cases. Of course, during cases when font buffers do not work, font size can be computed using resize function in this script

JorjMcKie · 2022-01-18T07:26:16Z

JorjMcKie
Jan 18, 2022
Maintainer

just looked again at FontForge: you can use that for font conversion, too.
So take a TTF / OTFfont and output it as CID, etc.

1 reply

qwertynik Jan 19, 2022
Author

@JorjMcKie The CID related font error when processing the PDF using pdf2htmlex was resolved when directly loading the TTF font from the file system - reappears when using the fonts directly from the library indexed in the Base14_fontnames at fitz/fitz.py:1900.

This is not directly related to PyMuPDF - posting here in case it can point to something in the library that could be improved.

JorjMcKie · 2022-01-20T11:14:07Z

JorjMcKie
Jan 20, 2022
Maintainer

I have reviewed again why spans are misplaced in some occasions, but not in others, and found another small wrinkle that is causing this.
After implementing the change, your two example files PHT23.pdf / PHT22.pdf are now both processed correctly just using the "dict" option.
I will create a set of pre-wheels and will let you know when they are done.

9 replies

qwertynik Jan 21, 2022
Author

Ran this script on 1.19.5 on PHT23.pdf @JorjMcKie. The script looks more cleaner.

Some text is rendered incorrectly. On inspecting, realized that the usage of origin could be the source of this issue.

JorjMcKie Jan 21, 2022
Maintainer

oops - sorry gave you an old py38 wheel:
PyMuPDF-1.19.5-cp38-cp38-win_amd64.zip
Have to regenerate the Linux wheel also - wait

qwertynik Jan 21, 2022
Author

No problem. Will use this in sometime and post feedback here ✏️

qwertynik Jan 21, 2022
Author

Tried with the above wheel - similar issue persists. Checking if it is due to a a local issue.

qwertynik Jan 21, 2022
Author

Worked as expected. Had to restart the PyCharm IDE.

JorjMcKie · 2022-01-21T08:21:43Z

JorjMcKie
Jan 21, 2022
Maintainer

PyMuPDF-1.19.5-cp38-cp38-win_amd64.zip
Windows: First unzip, then do py -3.8 -m pip install --force-reinstall PyMuPDF-1.19.5-cp38-cp38-win_amd64.whl.
Linux and Mac OSX pre-wheels can be found here for download. However, there is a problem with your Python version: 3.6 dropped out of support last December. Wheels for unsupported versions are normally no longer created, but for this time, I can do it once more - needs an hour or so.

0 replies

JorjMcKie · 2022-01-21T11:02:47Z

JorjMcKie
Jan 21, 2022
Maintainer

You can now do your tests on WIndows or Linux

5 replies

qwertynik Jan 21, 2022
Author

Is this the linux wheel for Python 3.6?

JorjMcKie Jan 21, 2022
Maintainer

No, you seem to have Python 3.6 in the Linux of your WSL. If necessary, confirm via wsl -- python3 -V.
If so, choose PyMuPDF-1.19.5-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

JorjMcKie Jan 21, 2022
Maintainer

Also, do consider to upgrade your Linux Python: as I mentioned, version 3.6 is out of maintenance since December 2021.
This causes, or will cause soon no more maintenance updates to it and an increasing lot of Python packages with no support for it.

Like PyMuPDF: I will not publish a 3.6 version for my v1.19.5 on PyPI (planned for end of January). Anyone needing this version for 3.6 would need to build it from sources by himself.
I will also drop any restrictions in using source language features unsupported in versions prior to 3.7. When this happens however, I will also announce it explicitely - there currently are no plans in that direction. Candidates for things in that area are the rapidly changing typing module: annotation expressions like List[Tuple] are now possible, etc.

qwertynik Jan 21, 2022
Author

Sure, this makes sense. No matter how avoidable an update appears, it is always best to do it at the earliest.

qwertynik Jan 28, 2022
Author

@JorjMcKie The setup on Windows worked seamlessly. However, on Linux the following error is thrown:

ERROR: Exception:
Traceback (most recent call last):
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 164, in exc_logging_wrapper
    status = run_func(*args)
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/cli/req_command.py", line 205, in wrapper
    return func(self, options, args)
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 339, in run
    reqs, check_supported_wheels=not options.target_dir
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 73, in resolve
    collected = self.factory.collect_root_requirements(root_reqs)
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 470, in collect_root_requirements
    requested_extras=(),
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 435, in _make_requirement_from_install_req
    version=None,
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
    version=version,
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 287, in __init__
    version=version,
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 156, in __init__
    self.dist = self._prepare()
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 225, in _prepare
    dist = self._prepare_distribution()
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 292, in _prepare_distribution
    return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/operations/prepare.py", line 482, in prepare_linked_requirement
    return self._prepare_linked_requirement(req, parallel_builds)
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/operations/prepare.py", line 550, in _prepare_linked_requirement
    self.build_isolation,
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/operations/prepare.py", line 59, in _get_prepared_distribution
    return abstract_dist.get_metadata_distribution()
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/distributions/wheel.py", line 26, in get_metadata_distribution
    return get_wheel_distribution(wheel, canonicalize_name(self.req.name))
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/metadata/__init__.py", line 51, in get_wheel_distribution
    return Distribution.from_wheel(wheel, canonical_name)
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/metadata/pkg_resources.py", line 37, in from_wheel
    with wheel.as_zipfile() as zf:
  File "/home/nik/.local/lib/python3.6/site-packages/pip/_internal/metadata/base.py", line 321, in as_zipfile
    return zipfile.ZipFile(self.location, allowZip64=True)
  File "/usr/lib/python3.6/zipfile.py", line 1131, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.6/zipfile.py", line 1198, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Looks like it is pointing towards an error in the whl file. Can you help resolve the error?

JorjMcKie · 2022-01-28T11:11:05Z

JorjMcKie
Jan 28, 2022
Maintainer

Have you checked whether this error message is actually true, e.g. via some other ZIP program? 7zip?
I did, and found no errors.
Also the wheel has been created alongside the bunch of the Python version wheels.
Then, depending on the PIP version installed with your zombie Python, it might not accept the file format. Try to upgrade it ... and hope this is still possible for this Python.

12 replies

qwertynik Jan 28, 2022
Author

Command used:
python3.7 -m pip install PyMuPDF-1.19.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Output:

PyMuPDF-1.19.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl is not a supported wheel on this platform.

Python: Python 3.7.12

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.6 LTS
Release:        18.04
Codename:       bionic

Any ideas on why there is a failure? @JorjMcKie. Platforms appear to be correct.

JorjMcKie Jan 28, 2022
Maintainer

WTF is going on there?!
It's not a 32bit Python, or is it? Check with python3.7 -c "import sys;print(sys.maxsize > 2**32)". Must be True.

qwertynik Jan 28, 2022
Author

It is not @JorjMcKie

nik@DESKTOP-B3CS5F7:~/Downloads$ python3.7 -c "import sys;print(sys.maxsize > 2**32)"
True

JorjMcKie Jan 28, 2022
Maintainer

I am out of good ideas for now 😒

qwertynik Jan 28, 2022
Author

Hmm, would be great if you could let me know when something lights up.

qwertynik · 2022-03-12T10:23:05Z

qwertynik
Mar 12, 2022
Author

@JorjMcKie

Attempted to generate a PDF with only text using two approaches:

PDF when using TTF files directly (image blurred in this case for confidentiality purposes):

PDF when using fonts extracted from the PDF:

Any ideas on why the PDF's font is broken when generated using the fonts in the PDF?

1 reply

qwertynik Mar 14, 2022
Author

@JorjMcKie Any ideas on why the above is happening?

Generating new PDF from another PDF with text only #1527

Uh oh!

Uh oh!

qwertynik Jan 11, 2022

Replies: 9 comments · 58 replies

Uh oh!

JorjMcKie Jan 11, 2022 Maintainer

Uh oh!

Uh oh!

qwertynik Jan 11, 2022 Author

Uh oh!

JorjMcKie Jan 11, 2022 Maintainer

Uh oh!

qwertynik Jan 12, 2022 Author

Uh oh!

Uh oh!

qwertynik Jan 17, 2022 Author

Uh oh!

JorjMcKie Jan 17, 2022 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jan 17, 2022 Maintainer

Uh oh!

qwertynik Jan 18, 2022 Author

Uh oh!

JorjMcKie Jan 18, 2022 Maintainer

Uh oh!

qwertynik Jan 18, 2022 Author

Uh oh!

JorjMcKie Jan 18, 2022 Maintainer

Uh oh!

qwertynik Jan 19, 2022 Author

Uh oh!

qwertynik Jan 19, 2022 Author

Uh oh!

Uh oh!

qwertynik Jan 19, 2022 Author

Uh oh!

JorjMcKie Jan 19, 2022 Maintainer

Uh oh!

qwertynik Jan 20, 2022 Author

Uh oh!

JorjMcKie Jan 18, 2022 Maintainer

Uh oh!

Uh oh!

qwertynik Jan 19, 2022 Author

Uh oh!

JorjMcKie Jan 20, 2022 Maintainer

Uh oh!

Uh oh!

qwertynik Jan 21, 2022 Author

Uh oh!

JorjMcKie Jan 21, 2022 Maintainer

Uh oh!

qwertynik Jan 21, 2022 Author

Uh oh!

qwertynik Jan 21, 2022 Author

Uh oh!

Uh oh!

qwertynik Jan 21, 2022 Author

Uh oh!

JorjMcKie Jan 21, 2022 Maintainer

Uh oh!

JorjMcKie Jan 21, 2022 Maintainer

Uh oh!

qwertynik
Jan 11, 2022

Replies: 9 comments 58 replies

JorjMcKie
Jan 11, 2022
Maintainer

qwertynik Jan 11, 2022
Author

JorjMcKie Jan 11, 2022
Maintainer

qwertynik Jan 12, 2022
Author

qwertynik
Jan 17, 2022
Author

JorjMcKie Jan 17, 2022
Maintainer

JorjMcKie Jan 17, 2022
Maintainer

qwertynik Jan 18, 2022
Author

JorjMcKie Jan 18, 2022
Maintainer

qwertynik Jan 18, 2022
Author

JorjMcKie
Jan 18, 2022
Maintainer

qwertynik Jan 19, 2022
Author

qwertynik Jan 19, 2022
Author

qwertynik Jan 19, 2022
Author

JorjMcKie Jan 19, 2022
Maintainer

qwertynik Jan 20, 2022
Author

JorjMcKie
Jan 18, 2022
Maintainer

qwertynik Jan 19, 2022
Author

JorjMcKie
Jan 20, 2022
Maintainer

qwertynik Jan 21, 2022
Author

JorjMcKie Jan 21, 2022
Maintainer

qwertynik Jan 21, 2022
Author

qwertynik Jan 21, 2022
Author

qwertynik Jan 21, 2022
Author

JorjMcKie
Jan 21, 2022
Maintainer

JorjMcKie
Jan 21, 2022
Maintainer