Does not recognize dashes(lines) #1586
-
@richramalho Describe the bug (mandatory)I have this test.pdf and I want to extract the text from it, but I need the dashes (lines) after each subtext to appear so that I can separate these texts, is there any way to do this? Thanks for the library, it works very well. Sorry for my English, I still don't have a good fluency in the language |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 1 reply
-
Thanks for the appreciation! But what is the bug? |
Beta Was this translation helpful? Give feedback.
-
It really doesn't have a bug, but I can't create another type of issue (bug or feature), I just want to know if there's a solution for my case. |
Beta Was this translation helpful? Give feedback.
-
Ah, ok. You should have created a "Discussions" item. I will convert this then. |
Beta Was this translation helpful? Give feedback.
-
You mean the line after each "DECRETO"? |
Beta Was this translation helpful? Give feedback.
-
No, this marked lines: In the extraction these lines are just ignored, I didn't see any way to be able to transform this to some character |
Beta Was this translation helpful? Give feedback.
-
Yes, that is what I was referring to! >>> doc=fitz.open("test.pdf")
>>> page=doc[0]
>>> paths=page.get_drawings() # get all drawing items
>>> limit=page.rect.width/2 # only select shorter items
>>> height = 2 # only select lower items
>>> for p in paths:
if p["rect"].height <= 2 and p["rect"].width < limit:
page.draw_rect(p["rect"], color=(1,0,0))
Point(53.858299255371094, 500.12969970703125)
Point(53.858299255371094, 689.8663940429688)
Point(319.2283020019531, 374.6197204589844)
Point(319.2283020019531, 545.298583984375)
Point(319.2283020019531, 725.7590942382812)
>>> doc.ez_save("x.pdf") ... with this result: |
Beta Was this translation helpful? Give feedback.
Yes, that is what I was referring to!
These are drawings. They will not be contained in any
page.get_text()
.Method
page.get_drawings()
extracts them - together with other drawing items like rectangles, curves and so on.You could select these items like this: