Some text removed from pdf file after apply redaction #2459
Unanswered
ashifaliclientpoint
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello
I am using pymypdf(1.19.6) to search string from a pdf file. And doing redaction. After doing redaction some string removed from output pdf file.
ex.
In DRAFT_Executive.pdf file I have a string [ i:b:o ] just after A Transparent Partnership. at page no 5
After doing redaction its next link string of integrity, and our success is removed. I had attached two files one is with redaction and one is without redaction. You can check the difference and can generate the issue.
Please reproduce this issue and give any solution.
I am also attaching the original file and converted file.
DRAFT_Executive.pdf
highlighted_file.pdf
highlighted_file-without-redaction.pdf
Reproduce step
import re
import fitz
import sys, json
file_path = "DRAFT_Executive.pdf"
pattern = r'[\s*([s|c|d|i|t]):([a-z]):([o|r])\s*]' # Replace with your desired regex pattern
doc = fitz.open(file_path)
resultOutput = []
tagsPerPage = {}
addedTags = set()
for page in doc:
text = page.get_text()
tagsPerPage[page.number]=[]
matches = re.finditer(pattern, text, re.IGNORECASE | re.MULTILINE | re.DOTALL)
if matches:
for match in matches:
start, end = match.span()
coordinates = page.search_for(match.group())
tempDict={}
firstCoordStr = ""
singleTagArr = []
needleStarted=0
for rect in coordinates:
x1, y2, x2, y1 = rect
if tagsPerPage:
for page in doc:
if tagsPerPage[page.number]:
for item in tagsPerPage[page.number]:
currPage= page.number+1
if item['page']==currPage:
y1 = page.rect.height - item['y1']
y2 = page.rect.height - item['y2']
doc.save("highlighted_file.pdf", garbage=3, deflate=True)
doc.close()
Configuration
OS ubuntu
Python 3.8
PyMuPDF 1.19.6
Beta Was this translation helpful? Give feedback.
All reactions