Identify background color for character #1201
Replies: 7 comments 5 replies
-
Interesting problem! I have seen this effect myself occasionally: office program exports to PDF do not create annotations. Part of the PDF spec is the 'Redaction' annotation type, which serves as a rectangular marker for page areas to be wiped out (permantly - without leaving traces of the original content). The next step would be to "apply those redactions, which actually does that permanent removal.
If I am on the same page with you: PyMuPDF has the technology to do all of the above:
How safe / secure is this approach?
|
Beta Was this translation helpful? Give feedback.
-
No worries at all! Didn't I write, you should not hesitate asking for help ... You example is indeed weird. It is a rectangle overlaid with another one (a so-called "clip" rectangle) with a special spec that wipes out what is underneath. It would probably also make sense to:
|
Beta Was this translation helpful? Give feedback.
-
Very good advice, thanks! By the way, since we’re corresponding, I thought I’d pass along a PDF that I found that yields an error when I do a get_drawings() on its one page. It’s not a problem for my program; I just catch the exception and skip the file (such files seem to be quite rare in my dataset). But I thought I’d pass it along, in case you’re interested. Here’s the error I get:
Traceback (most recent call last):
File "Q:\e\redact\showdrawings.py", line 11, in <module>
printdrawings ("testc.pdf")
File "Q:\e\redact\showdrawings.py", line 7, in printdrawings
for drawing in page.get_drawings():
File "C:\Users\Eugene\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\fitz\fitz.py", line 6006, in get_drawings
path["rect"] = Rect(x0, y0, x1, y1)
UnboundLocalError: local variable 'x0' referenced before assignment
This happens when running this program:
import fitz
def printdrawings (file):
print (">>>", file)
doc = fitz.open(file)
for page in doc:
for drawing in page.get_drawings():
print (drawing)
doc.close()
printdrawings ("testc.pdf")
Eugene
From: Jorj X. McKie ***@***.******@***.***>>
Sent: Thursday, August 12, 2021 8:37 AM
To: pymupdf/PyMuPDF ***@***.******@***.***>>
Cc: Volokh, Eugene ***@***.******@***.***>>; Author ***@***.******@***.***>>
Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201)
No worries at all! Didn't I write, you should not hesitate asking for help ...
You example is indeed weird. It is a rectangle overlaid with another one (a so-called "clip" rectangle) with a special spec that wipes out what is underneath.
Don't ask me why the original rect was specified at all in the first place.
Anyway, "clip" rectangles are currently not returned by page.get-drawings(). And if they were, it would be quite a hassle to conclude frm their specs that another rect's color doesn't count ...
So I currently only can recommend to use heuristics to exclude rectangles that cannot make sense. in this special case, the rectangle has a much greater height than width and covers several lines - as opposed to what you typically have like in testb.pdf: a rectangle covering some part of one line, width:height ratio is much smaller than 1.
It would probably also make sense to:
1. restrict the set of paths in page.get_drawings() to those which are rectangles in the first place:
* path["items"] has length 1
* the draw command in that only item is "re"
2. disregard paths with min(path["rect"].width, path["rect"].height) <= threshold e.g. threshold = 3
3. potentially disregard rectangles with a ratio width : height (much) larger than 1
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#1201 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4SRUPTECVODIPIS2P3T4PTCJANCNFSM5B4Z7JPQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Beta Was this translation helpful? Give feedback.
-
Aha, thanks very much!
From: Jorj X. McKie ***@***.***>
Sent: Thursday, August 12, 2021 10:27 AM
To: pymupdf/PyMuPDF ***@***.***>
Cc: Volokh, Eugene ***@***.***>; Author ***@***.***>
Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201)
This is a bug in PyMuPDF, detected just a few days ago.
It only happens with documents containing paths without any draw commands.
Those paths make no sense of course, but my logic should be immune against it.
I am working on a new version, where this is fixed and at the same time that method page.get_drawings() will be very much faster ...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4TWKS5LQARVL3QA5SLT4QAADANCNFSM5B4Z7JPQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Beta Was this translation helpful? Give feedback.
-
Since you’re rolling out a new version, I thought I might note another glitch – when I run this Python program on the attached file,
import fitz
doc = fitz.open("testd.pdf")
for page in doc:
pass
doc.close()
I get the following:
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
mupdf: kid not found in parent's kids array
I also get this on a few other files from the dataset of several thousand that I’ve downloaded. Thanks,
Eugene
From: Jorj X. McKie ***@***.***>
Sent: Thursday, August 12, 2021 10:27 AM
To: pymupdf/PyMuPDF ***@***.***>
Cc: Volokh, Eugene ***@***.***>; Author ***@***.***>
Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201)
This is a bug in PyMuPDF, detected just a few days ago.
It only happens with documents containing paths without any draw commands.
Those paths make no sense of course, but my logic should be immune against it.
I am working on a new version, where this is fixed and at the same time that method page.get_drawings() will be very much faster ...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4TWKS5LQARVL3QA5SLT4QAADANCNFSM5B4Z7JPQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Beta Was this translation helpful? Give feedback.
-
Running this through:
import fitz
doc = fitz.open("teste.pdf")
for page in doc:
pass
doc.close()
yields:
mupdf: cannot find page 0 in page tree
From: Jorj X. McKie ***@***.***>
Sent: Thursday, August 12, 2021 10:27 AM
To: pymupdf/PyMuPDF ***@***.***>
Cc: Volokh, Eugene ***@***.***>; Author ***@***.***>
Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201)
This is a bug in PyMuPDF, detected just a few days ago.
It only happens with documents containing paths without any draw commands.
Those paths make no sense of course, but my logic should be immune against it.
I am working on a new version, where this is fixed and at the same time that method page.get_drawings() will be very much faster ...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4TWKS5LQARVL3QA5SLT4QAADANCNFSM5B4Z7JPQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Beta Was this translation helpful? Give feedback.
-
Very helpful, thanks!
From: Jorj X. McKie ***@***.***>
Sent: Friday, August 13, 2021 1:39 AM
To: pymupdf/PyMuPDF ***@***.***>
Cc: Volokh, Eugene ***@***.***>; Author ***@***.***>
Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201)
Messages like mupdf: cannot find page 0 in page tree and mupdf: kid not found in parent's kids array indicate PDFs with a damaged internal file structure.
I know they look frightening, but MuPDF is more often than not able to overcome these things by e.g. re-scanning (parts of) the file and then rebuilding damaged information.
Apart from sloppy or buggy PDF creation processes, incomplete downloads are often the cause of these problems.
There is little to nothing you can do otherwise about it.
You may want to save such PDFs with the maximum garbage collection level 4 and hope for the best. You could also create a completely new PDF and then insert all the old file's pages:
old = fitz.open("damaged.pdf")
new = fitz.open()
new.insert_pdf(old)
new.set_metadata(old.metadata) # copy over old metadata
new.set_toc(old.get_toc(True) # copy over old table of contents
The new PDF would look like the old one. But it is a page-wise process: any old PDF content not referenced by in its pages will therefore not be copied. I have explicitely included copying two examples for such content ... but there may be more.
Best would probably be just looking away 😎.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4URDIVI6CL35JPFRUDT4TKY7ANCNFSM5B4Z7JPQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When someone highlights a word in Word, and then creates a PDF, that highlighting doesn't appear to be an annotation, but presumably it's stored somewhere in the PDF, as an indication of the background color that corresponds to a particular character. My question: How can one extract that background color information, presumably on a character-by-character basis? (One possible answer is to create a pixmap from the bbox for each character and iterate through the pixels, averaging the colors together; but that's very slow, and seems like a needlessly complex workaround for what I expect could be done much more directly.)
The background: My son (who's in high school) and I are trying to create a free program that courts can use to spot and fix insecure redactions of text -- situations where a lawyer redacted sensitive material not by using a proper redaction tool, but by highlighting the text in black. That makes it look redacted on the screen, but the text remains in the file (both in Word and in PDF generated from Word), and can easily be copied and pasted from the file. The result can be a massive loss of privacy (and indeed sometimes risk to life, for instance if the redacted material is the name of a witness to a gang crime or some such).
The right solution, of course, is to have lawyers and judges properly redact things in the first place. But I've seen many documents where that can't be done (I'm a law professor at UCLA, and I study such things), so we'd like to write a tool that courts can use to take care of that. And we'd like it to be at least moderately efficient, since the plan would be to scan many documents to find the redaction fails.
Many thanks,
Eugene
Beta Was this translation helpful? Give feedback.
All reactions