Identify background color for character #1201

eugenevolokh · 2021-08-10T20:02:02Z

eugenevolokh
Aug 10, 2021

When someone highlights a word in Word, and then creates a PDF, that highlighting doesn't appear to be an annotation, but presumably it's stored somewhere in the PDF, as an indication of the background color that corresponds to a particular character. My question: How can one extract that background color information, presumably on a character-by-character basis? (One possible answer is to create a pixmap from the bbox for each character and iterate through the pixels, averaging the colors together; but that's very slow, and seems like a needlessly complex workaround for what I expect could be done much more directly.)

The background: My son (who's in high school) and I are trying to create a free program that courts can use to spot and fix insecure redactions of text -- situations where a lawyer redacted sensitive material not by using a proper redaction tool, but by highlighting the text in black. That makes it look redacted on the screen, but the text remains in the file (both in Word and in PDF generated from Word), and can easily be copied and pasted from the file. The result can be a massive loss of privacy (and indeed sometimes risk to life, for instance if the redacted material is the name of a witness to a gang crime or some such).

The right solution, of course, is to have lawyers and judges properly redact things in the first place. But I've seen many documents where that can't be done (I'm a law professor at UCLA, and I study such things), so we'd like to write a tool that courts can use to take care of that. And we'd like it to be at least moderately efficient, since the plan would be to scan many documents to find the redaction fails.

Many thanks,

Eugene

JorjMcKie · 2021-08-10T20:46:16Z

JorjMcKie
Aug 10, 2021
Maintainer

Interesting problem! I have seen this effect myself occasionally: office program exports to PDF do not create annotations.
Which is quite ok, because annotations conceptually are like dust on a picture: it can be wiped away without requiring full PDF permissions.

Part of the PDF spec is the 'Redaction' annotation type, which serves as a rectangular marker for page areas to be wiped out (permantly - without leaving traces of the original content). The next step would be to "apply those redactions, which actually does that permanent removal.
This 2-step process could be interrupted, so a reviewer may approve of step 2, etc.
Your case seems to be

Locating page rectangles with a specific color (i.e. black - ?)
Locating text - in units of so-called "spans", or words, or even just single characters, which are located inside some of those black rectangles
Create a redaction annotation for each of those matches
Apply all redactions annots on that page.

If I am on the same page with you: PyMuPDF has the technology to do all of the above:

page.get_drawings() retrieves everything, which is no text and no regular image. It technically is a list of Python dictionaries.
Make a selection of the previous list, wich consists of rectangles of a certain color (e.g. black).
Make a list of words on the page and check which ones are contained in one of the black rectangles. There are several text extraction methods that offer alternatives: page.get_text("words") or text spans or single characters using page.get_text("dict") or page.get_text("rawdict"), respectively.
For each previous match, make a redaction annotation.
Apply the readctions.

How safe / secure is this approach?

I cannot determnine, whether a black rectangle is "above" the text or "below". So, there is a risk of undesired removal of text. However, as text color can also be extracted, additional checks help minimize that risk.
"Black" may not mean really black, black: depending on how those rectangles have been drawn, you may have to extend that search to some dark "shade of gray" - this depends on your experience with your documents

3 replies

eugenevolokh Aug 11, 2021
Author

Tremendously helpful, many thanks! Working on implementing this now.

JorjMcKie Aug 11, 2021
Maintainer

Please do ask if help is needed.

eugenevolokh Aug 12, 2021
Author

Very sorry to trouble you again, but something about the drawings leaves me confused. Most of the time, I can figure out the significance of the drawings in the files I'm readings (court filings). There are rectangles with very narrow width or height -- those are vertical lines or underlines. There are rectangles of roughly 14 pixels height -- those are redactions. All good!

But sometimes I see a rectangle that looks huge, but doesn't correspond to any visible black marks on the page, e.g.,
{'color': None, 'fill': [0.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(21.600000381469727, 72.0, 57.599998474121094, 720.0009765625), 'items': [('re', Rect(21.600000381469727, 72.0, 57.599998474121094, 720.0009765625))], 'opacity': 1.0}

Looks like it should be something big, occupying much of the left-hand inch (21.6 pixels to 57.6 pixels) throughout most of the page (72 pixels to 720 pixels, which is to say 1 inch to 10 inches). And yet all I see there is text on a normal white background. Now if it had something indicating that the drawing was colored white, or if it had some transparency flag set, then I'd understand; but all the flags I see when I print a drawing object look the same as for a block of redacted text. It's as if there's some element of the drawing object that print isn't showing me -- but why would that be?

You should be able to see that in the first attached file, fourth drawing; compare to the redaction drawings in the second attached file. When I wrote this test program:

import fitz

def printdrawings (file):
print (">>>", file)
doc = fitz.open(file)
for page in doc:
for drawing in page.get_drawings():
print (drawing)
doc.close()

printdrawings ("testa.pdf")
printdrawings ("testb.pdf")

I got this result; I tried to bold below the mystery object in testa.pdf, and for comparison a black redaction box in testb.pdf

testa.pdf
{'color': [1.0], 'fill': None, 'width': 0.7200000286102295, 'lineJoin': 1, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(68.4000015258789, 0.0, 68.4000015258789, 792.0020141601562), 'items': [('l', Point(68.4000015258789, 0.0), Point(68.4000015258789, 792.0020141601562))], 'opacity': 1.0}
{'color': [1.0], 'fill': None, 'width': 0.7200000286102295, 'lineJoin': 1, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(64.80000305175781, 0.0, 64.80000305175781, 792.0020141601562), 'items': [('l', Point(64.80000305175781, 0.0), Point(64.80000305175781, 792.0020141601562))], 'opacity': 1.0}
{'color': [1.0], 'fill': None, 'width': 0.7200000286102295, 'lineJoin': 1, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(579.0009765625, 0.0, 579.0009765625, 792.0020141601562), 'items': [('l', Point(579.0009765625, 0.0), Point(579.0009765625, 792.0020141601562))], 'opacity': 1.0}
{'color': None, 'fill': [0.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(21.600000381469727, 72.0, 57.599998474121094, 720.0009765625), 'items': [('re', Rect(21.600000381469727, 72.0, 57.599998474121094, 720.0009765625))], 'opacity': 1.0}
{'color': None, 'fill': [1.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(71.30400085449219, 464.7109680175781, 310.2440185546875, 465.1910095214844), 'items': [('re', Rect(71.30400085449219, 464.7109680175781, 310.2440185546875, 465.1910095214844))], 'opacity': 1.0}
{'color': None, 'fill': [1.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(310.2510070800781, 324.531005859375, 310.7309875488281, 465.1910095214844), 'items': [('re', Rect(310.2510070800781, 324.531005859375, 310.7309875488281, 465.1910095214844))], 'opacity': 1.0}
{'color': None, 'fill': [1.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(310.2510070800781, 464.7109680175781, 310.7309875488281, 465.1910095214844), 'items': [('re', Rect(310.2510070800781, 464.7109680175781, 310.7309875488281, 465.1910095214844))], 'opacity': 1.0}
testb.pdf
{'color': None, 'fill': [0.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(108.0, 120.489990234375, 540.0, 121.67999267578125), 'items': [('re', Rect(108.0, 120.489990234375, 540.0, 121.67999267578125))], 'opacity': 1.0}
{'color': None, 'fill': [0.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(108.0, 134.28997802734375, 149.63998413085938, 135.47998046875), 'items': [('re', Rect(108.0, 134.28997802734375, 149.63998413085938, 135.47998046875))], 'opacity': 1.0}
{'color': None, 'fill': [0.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(141.22999572753906, 232.20001220703125, 166.54998779296875, 246.0), 'items': [('re', Rect(141.22999572753906, 232.20001220703125, 166.54998779296875, 246.0))], 'opacity': 1.0}
{'color': None, 'fill': [0.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(273.3500061035156, 315.0, 536.8599853515625, 328.79998779296875), 'items': [('re', Rect(273.3500061035156, 315.0, 536.8599853515625, 328.79998779296875))], 'opacity': 1.0}
{'color': None, 'fill': [0.0], 'width': 1.0, 'lineJoin': 0, 'lineCap': (0, 0, 0), 'dashes': '[] 0', 'closePath': False, 'even_odd': False, 'rect': Rect(412.54998779296875, 480.6099853515625, 437.8699951171875, 494.39996337890625), 'items': [('re', Rect(412.54998779296875, 480.6099853515625, 437.8699951171875, 494.39996337890625))], 'opacity': 1.0}

testa.pdf
testb.pdf

I am befuddled. Many thanks,

Eugene

JorjMcKie · 2021-08-12T15:36:57Z

JorjMcKie
Aug 12, 2021
Maintainer

No worries at all! Didn't I write, you should not hesitate asking for help ...

You example is indeed weird. It is a rectangle overlaid with another one (a so-called "clip" rectangle) with a special spec that wipes out what is underneath.
Don't ask me why the original rect was specified at all in the first place.
Anyway, "clip" rectangles are currently not returned by page.get-drawings(). And if they were, it would be quite a hassle to conclude frm their specs that another rect's color doesn't count ...
So I currently only can recommend to use heuristics to exclude rectangles that cannot make sense. in this special case, the rectangle has a much greater height than width and covers several lines - as opposed to what you typically have like in testb.pdf: a rectangle covering some part of one line, width:height ratio is much smaller than 1.

It would probably also make sense to:

restrict the set of paths in page.get_drawings() to those which are rectangles in the first place:
- path["items"] has length 1
- the draw command in that only item is "re"
disregard paths with min(path["rect"].width, path["rect"].height) <= threshold e.g. threshold = 3
potentially disregard rectangles with a ratio width : height (much) larger than 1

0 replies

eugenevolokh · 2021-08-12T16:54:10Z

eugenevolokh
Aug 12, 2021
Author

Very good advice, thanks! By the way, since we’re corresponding, I thought I’d pass along a PDF that I found that yields an error when I do a get_drawings() on its one page. It’s not a problem for my program; I just catch the exception and skip the file (such files seem to be quite rare in my dataset). But I thought I’d pass it along, in case you’re interested. Here’s the error I get: Traceback (most recent call last): File "Q:\e\redact\showdrawings.py", line 11, in <module> printdrawings ("testc.pdf") File "Q:\e\redact\showdrawings.py", line 7, in printdrawings for drawing in page.get_drawings(): File "C:\Users\Eugene\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\fitz\fitz.py", line 6006, in get_drawings path["rect"] = Rect(x0, y0, x1, y1) UnboundLocalError: local variable 'x0' referenced before assignment This happens when running this program: import fitz def printdrawings (file): print (">>>", file) doc = fitz.open(file) for page in doc: for drawing in page.get_drawings(): print (drawing) doc.close() printdrawings ("testc.pdf") Eugene From: Jorj X. McKie ***@***.******@***.***>> Sent: Thursday, August 12, 2021 8:37 AM To: pymupdf/PyMuPDF ***@***.******@***.***>> Cc: Volokh, Eugene ***@***.******@***.***>>; Author ***@***.******@***.***>> Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201) No worries at all! Didn't I write, you should not hesitate asking for help ... You example is indeed weird. It is a rectangle overlaid with another one (a so-called "clip" rectangle) with a special spec that wipes out what is underneath. Don't ask me why the original rect was specified at all in the first place. Anyway, "clip" rectangles are currently not returned by page.get-drawings(). And if they were, it would be quite a hassle to conclude frm their specs that another rect's color doesn't count ... So I currently only can recommend to use heuristics to exclude rectangles that cannot make sense. in this special case, the rectangle has a much greater height than width and covers several lines - as opposed to what you typically have like in testb.pdf: a rectangle covering some part of one line, width:height ratio is much smaller than 1. It would probably also make sense to: 1. restrict the set of paths in page.get_drawings() to those which are rectangles in the first place: * path["items"] has length 1 * the draw command in that only item is "re" 2. disregard paths with min(path["rect"].width, path["rect"].height) <= threshold e.g. threshold = 3 3. potentially disregard rectangles with a ratio width : height (much) larger than 1 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#1201 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4SRUPTECVODIPIS2P3T4PTCJANCNFSM5B4Z7JPQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

1 reply

JorjMcKie Aug 12, 2021
Maintainer

This is a bug in PyMuPDF, detected just a few days ago.
It only happens with documents containing paths without any draw commands.
Those paths make no sense of course, but my logic should be immune against it.

I am working on a new version, where this is fixed and at the same time that method page.get_drawings() will be very much faster ...

eugenevolokh · 2021-08-12T17:34:39Z

eugenevolokh
Aug 12, 2021
Author

Aha, thanks very much! From: Jorj X. McKie ***@***.***> Sent: Thursday, August 12, 2021 10:27 AM To: pymupdf/PyMuPDF ***@***.***> Cc: Volokh, Eugene ***@***.***>; Author ***@***.***> Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201) This is a bug in PyMuPDF, detected just a few days ago. It only happens with documents containing paths without any draw commands. Those paths make no sense of course, but my logic should be immune against it. I am working on a new version, where this is fixed and at the same time that method page.get_drawings() will be very much faster ... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4TWKS5LQARVL3QA5SLT4QAADANCNFSM5B4Z7JPQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

0 replies

eugenevolokh · 2021-08-12T21:40:02Z

eugenevolokh
Aug 12, 2021
Author

Since you’re rolling out a new version, I thought I might note another glitch – when I run this Python program on the attached file, import fitz doc = fitz.open("testd.pdf") for page in doc: pass doc.close() I get the following: mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array mupdf: kid not found in parent's kids array I also get this on a few other files from the dataset of several thousand that I’ve downloaded. Thanks, Eugene From: Jorj X. McKie ***@***.***> Sent: Thursday, August 12, 2021 10:27 AM To: pymupdf/PyMuPDF ***@***.***> Cc: Volokh, Eugene ***@***.***>; Author ***@***.***> Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201) This is a bug in PyMuPDF, detected just a few days ago. It only happens with documents containing paths without any draw commands. Those paths make no sense of course, but my logic should be immune against it. I am working on a new version, where this is fixed and at the same time that method page.get_drawings() will be very much faster ... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4TWKS5LQARVL3QA5SLT4QAADANCNFSM5B4Z7JPQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

0 replies

eugenevolokh · 2021-08-13T00:25:27Z

eugenevolokh
Aug 13, 2021
Author

Running this through: import fitz doc = fitz.open("teste.pdf") for page in doc: pass doc.close() yields: mupdf: cannot find page 0 in page tree From: Jorj X. McKie ***@***.***> Sent: Thursday, August 12, 2021 10:27 AM To: pymupdf/PyMuPDF ***@***.***> Cc: Volokh, Eugene ***@***.***>; Author ***@***.***> Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201) This is a bug in PyMuPDF, detected just a few days ago. It only happens with documents containing paths without any draw commands. Those paths make no sense of course, but my logic should be immune against it. I am working on a new version, where this is fixed and at the same time that method page.get_drawings() will be very much faster ... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4TWKS5LQARVL3QA5SLT4QAADANCNFSM5B4Z7JPQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

1 reply

JorjMcKie Aug 13, 2021
Maintainer

Messages like mupdf: cannot find page 0 in page tree and mupdf: kid not found in parent's kids array indicate PDFs with a damaged internal file structure.
I know they look frightening, but MuPDF is more often than not able to overcome these things by e.g. re-scanning (parts of) the file and then rebuilding damaged information.
Apart from sloppy or buggy PDF creation processes, incomplete downloads are often the cause of these problems.
There is little to nothing you can do otherwise about it.
You may want to save such PDFs with the maximum garbage collection level 4 and hope for the best. You could also create a completely new PDF and then insert all the old file's pages:

old = fitz.open("damaged.pdf")
new = fitz.open()
new.insert_pdf(old)
new.set_metadata(old.metadata)  # copy over old metadata
new.set_toc(old.get_toc(True)  # copy over old table of contents

The new PDF would look like the old one. But it is a page-wise process: any old PDF content not referenced by in its pages will therefore not be copied. I have explicitely included copying two examples for such content ... but there may be more.

Best would probably be just looking away 😎.

eugenevolokh · 2021-08-13T15:45:17Z

eugenevolokh
Aug 13, 2021
Author

Very helpful, thanks! From: Jorj X. McKie ***@***.***> Sent: Friday, August 13, 2021 1:39 AM To: pymupdf/PyMuPDF ***@***.***> Cc: Volokh, Eugene ***@***.***>; Author ***@***.***> Subject: Re: [pymupdf/PyMuPDF] Identify background color for character (#1201) Messages like mupdf: cannot find page 0 in page tree and mupdf: kid not found in parent's kids array indicate PDFs with a damaged internal file structure. I know they look frightening, but MuPDF is more often than not able to overcome these things by e.g. re-scanning (parts of) the file and then rebuilding damaged information. Apart from sloppy or buggy PDF creation processes, incomplete downloads are often the cause of these problems. There is little to nothing you can do otherwise about it. You may want to save such PDFs with the maximum garbage collection level 4 and hope for the best. You could also create a completely new PDF and then insert all the old file's pages: old = fitz.open("damaged.pdf") new = fitz.open() new.insert_pdf(old) new.set_metadata(old.metadata) # copy over old metadata new.set_toc(old.get_toc(True) # copy over old table of contents The new PDF would look like the old one. But it is a page-wise process: any old PDF content not referenced by in its pages will therefore not be copied. I have explicitely included copying two examples for such content ... but there may be more. Best would probably be just looking away 😎. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#1201 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADLOZ4URDIVI6CL35JPFRUDT4TKY7ANCNFSM5B4Z7JPQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

0 replies

Identify background color for character #1201

Uh oh!

eugenevolokh Aug 10, 2021

Replies: 7 comments · 5 replies

Uh oh!

Uh oh!

JorjMcKie Aug 10, 2021 Maintainer

Uh oh!

eugenevolokh Aug 11, 2021 Author

Uh oh!

JorjMcKie Aug 11, 2021 Maintainer

Uh oh!

eugenevolokh Aug 12, 2021 Author

Uh oh!

JorjMcKie Aug 12, 2021 Maintainer

Uh oh!

eugenevolokh Aug 12, 2021 Author

Uh oh!

JorjMcKie Aug 12, 2021 Maintainer

Uh oh!

eugenevolokh Aug 12, 2021 Author

Uh oh!

eugenevolokh Aug 12, 2021 Author

Uh oh!

eugenevolokh Aug 13, 2021 Author

Uh oh!

JorjMcKie Aug 13, 2021 Maintainer

Uh oh!

eugenevolokh Aug 13, 2021 Author

eugenevolokh
Aug 10, 2021

Replies: 7 comments 5 replies

JorjMcKie
Aug 10, 2021
Maintainer

eugenevolokh Aug 11, 2021
Author

JorjMcKie Aug 11, 2021
Maintainer

eugenevolokh Aug 12, 2021
Author

JorjMcKie
Aug 12, 2021
Maintainer

eugenevolokh
Aug 12, 2021
Author

JorjMcKie Aug 12, 2021
Maintainer

eugenevolokh
Aug 12, 2021
Author

eugenevolokh
Aug 12, 2021
Author

eugenevolokh
Aug 13, 2021
Author

JorjMcKie Aug 13, 2021
Maintainer

eugenevolokh
Aug 13, 2021
Author