Replies: 1 comment
-
I've noticed the same issue. When we use a snippet code from the documentation to get rid of header and footer this seems to work properly: from PyPDF2 import PdfReader # the same for pypdf
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]
parts = []
def visitor_body(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 50 and y < 720: # page.artbox = RectangleObject([0, 0, 612, 792])
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body) Now, let's focus on the last page and try to reproduce layout of vector graphics included in this page: import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from PyPDF2 import PdfReader
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[18]
coo = []
def visitor_before(op, args, cm, tm):
if op == b"re":
coo.append([args[i].as_numeric() for i in range(4)])
def visitor_text(text, cm, tm, fontDict, fontSize):
pass
page.extract_text(visitor_operand_before=visitor_before, visitor_text=visitor_text)
fig, ax = plt.subplots()
fig.set_dpi(300)
ax.set_aspect(1)
ax.set_xlim(0, 612)
ax.set_ylim(0, 792)
for c in coo:
ax.add_patch(Rectangle((c[0], c[1]), c[2], c[3], fill=False, lw=0.2))
plt.show() What we expect and what we get is: Please, note that rectangles related to header and footer are positioned much better. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using pypdf 3.7.0 to extract text from a pdf file. I need the text's location to do subsequent operations, so I extract the text along with its x and y coordinates from the text matrix. However, while there is no issue with the x coordinate, there is something wrong with the y coordinates.
I tried to check the page size to make sure that the file wasn't scaled, which is correct (the page size is 612x792).
I think one of the ways to solve this issue is to do some modification with the transformation matrix (cm) with the text matrix (tm), but I haven't figured out how to do that.
Note: A reason why I think about the transformation matrix (cm) is that for other pdf files, its value is [1,0,0,1,0,0]. However, for this pdf file, the values of cm keep on changing (especially the last 2 elements in the matrix).
Link to the pdf file: https://drive.google.com/file/d/10KMQVAJPB2hQSOOT6OrnF0RGg82k6i31/view?usp=sharing
Below is a code example of the first page. (The issue happens with all the pages)
I printed the result of some of the transformation and text matrices:
Beta Was this translation helpful? Give feedback.
All reactions