Can I force mapping of pages to lines to be the same using get_text("words") and get_text("rawdict")? #1375
Replies: 8 comments 3 replies
-
There actually should be no deviation between the line number within a word item |
Beta Was this translation helpful? Give feedback.
-
For the testfile you gave me, there is no such deviation. |
Beta Was this translation helpful? Give feedback.
-
Aha! Found an error: the "words" in this context are defined by strings surrounded by spaces. If a line contains only spaces, then I obviousl do not increase the line count ... |
Beta Was this translation helpful? Give feedback.
-
Okay found a few minor issues:
After fixing these issues things are working as expected ... |
Beta Was this translation helpful? Give feedback.
-
Thanks very much!! Are these fixes isolated enough that I could apply them myself now, by creating local copies of the relevant functions? If not, any idea when the next release might be due? |
Beta Was this translation helpful? Give feedback.
-
The changes are in both, Python and C code. So you will need a new wheel. I can send you a prelim one when I am done. What was your platform again?
Holen Sie sich Outlook für Android<https://aka.ms/AAb9ysg>
…________________________________
From: maspotts ***@***.***>
Sent: Saturday, November 6, 2021 10:40:30 AM
To: pymupdf/PyMuPDF ***@***.***>
Cc: Jorj X. McKie ***@***.***>; Comment ***@***.***>
Subject: Re: [pymupdf/PyMuPDF] Can I force mapping of pages to lines to be the same using get_text("words") and get_text("rawdict")? (Discussion #1375)
Thanks very much!! Are these fixes isolated enough that I could apply them myself now, by creating local copies of the relevant functions? If not, any idea when the next release might be due?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#1375 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIS5AKE2LRJAEZPX2MLUKVD65ANCNFSM5HOKSOUA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
The wheel is here. Look at "Artifacts" at the bottom. There is a ZIP file whih should contain your wheel. |
Beta Was this translation helpful? Give feedback.
-
Thanks: got it. Verified bug #1364 is fixed (thanks!). I (embarrassedly) also noticed some remaining differences due to my having applied
and with
and similarly for every page transition. Not sure why With
Without
So it seems pretty clear that Mike |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi: I've noticed that the breakdown of a document into lines (mapping of page to an array of lines) is slightly different when I use
get_text("words")
vsget_text("rawdict")
: in my current document I see the first several 100 lines are identical, but further down the document they get out of sync. Is there any way to force the line breakdown to be the same irrespective of whether I specify "words
" or "rawdict
"? (The reason is: I've created a number of blacklists of line numbers (usingget_text("words")
) which I now want to exclude from processing: this worked fine when the processing also usedget_text("words")
, but now I've had to switch toget_text("rawdict")
(to get access to span-level metadata), and so as it stands my line-number based blacklists are suddenly usable because they (eventually) don't match the lines coming out ofget_text("rawdict")
). (I realize that going forward I should use "rawdict
" to extract line numbers to create new blacklists, but I don't want to have to re-create my existing blacklists (there are a lot of them!)).Thanks!
Mike
Beta Was this translation helpful? Give feedback.
All reactions