want Change colour of hind vowel diacritic . #3175

aleem75321 · 2024-02-18T07:27:47Z

aleem75321
Feb 18, 2024

Hi,

I want to change the colour of the hind vowel diacritic. I read the {#1532) Discussion and it kind of works for me only issue is they are unable to remove old diacritic. they just overlap it new over old.

my target vowel diacritic "ं". want to change colour only.
I tried both dict and rawdict methods both did not work.

I tried to search first and then change also did not work. when I tried to search they searched all diacritics on a page.
doc=fitz.open("test_pages/11022024_mtm_mp_03_1_col_r1.pdf")
page=doc[0]

search=page.search_for("ं")
print(search)

output:
[Rect(42.94731521606445, 127.79308319091797, 42.94731521606445, 141.1910858154297), Rect(501.72882080078125, 124.0313720703125, 501.72882080078125, 147.93136596679688), Rect(172.7761993408203, 306.5633544921875, 172.7761993408203, 371.8208312988281),....]

but after extraction text, it failed it gave an empty
for search in search:# loop because I have may tow many result
blocks=page.get_text("dict",clip=search)["blocks"]
print(blocks)
[]
[]
[]

Method 2:- Without search, they find and change colour but do not remove old

import fitz
import pandas as pd
from pathlib import Path
import numpy as np
doc=fitz.open("test_pages/11022024_mtm_mp_03_1_col_r1.pdf")
page=doc[0]
black = fitz.pdfcolor["black"]
blocks=page.get_text("rawdict",flags=11)["blocks"]
for block in blocks:
for lines in block["lines"]:
for span in lines["spans"]:
for char in span["chars"]:
if char["c"] == "ं" and span["color"]==15539236:
x0, y0, x1, y1 = span["bbox"] # Extract exact bounding box coordinates
rect = fitz.Rect(x0, y0, x1, y1)
print(char)
# Remove original text
# re-insert same text - different color
font = fitz.Font(fontname=span['font'],fontfile=f"{span['font']}.ttf") # this must be known somehow - or simply try some font else
page.add_redact_annot(rect,fontname=font,align=char["origin"])
tw = fitz.TextWriter(page.rect, color=black)
tw.append(char["origin"], text=char["c"], font=font,fontsize=span['size'])
tw.write_text(page)

Apply redactions after all text replacements

page.apply_redactions()

Saving Option

out_filename=Path('out') /"11022024_mtm_mp_03_2_col_r2.pdf"

save to a new PDF

doc.ez_save(out_filename)

original PDF
11022024_mtm_mp_01_1_col_r1.pdf

after PDF
11022024_mtm_mp_03_2_col_r2.pdf

JorjMcKie · 2024-02-18T08:14:45Z

JorjMcKie
Feb 18, 2024
Maintainer

We have noticed this already and submitted a bug report to MuPDF, please see here.

1 reply

aleem75321 Feb 18, 2024
Author

Thanks For your quick response. My issue is different. I just want to change the colour of this I can do that but Unable to remove the old one. You can see in the below screenshot black colour diacritic is showing but in the background it showing old red aslo

JorjMcKie · 2024-02-18T08:28:47Z

JorjMcKie
Feb 18, 2024
Maintainer

Your problem does go back to the issue I think:
Redaction logic does not completely remove text in some complex script systems like Devanagari: whatever artifacts remain occasionally, the redaction rectangle is not completely blanked out.

8 replies

aleem75321 Feb 18, 2024
Author

Sound good redaction rectangle with white can you give some simple code

JorjMcKie Feb 18, 2024
Maintainer

before writing the new text, do page.draw_rect(rect, fill=fitz.pdfcolor["white"]). Actually, doing add_redact_annot(rect, fill=fitz.pdfcolor["white"]) should also work.

aleem75321 Feb 20, 2024
Author

@JorjMcKie First a big thanks for your response.

As you suggested I tried insert_htmlbox but I got challenged in bbox. The a huge difference between the Inserted text position and the real text position> and I used the same font embedded in the PDF.

As you see I need to change the colour of " ं" in the headline and sub-headline not in body. so I filter by font size.

doc=fitz.open("test_pages/CCI_PAGE.pdf")
page=doc[0]
arch = fitz.Archive(r"C:\Git_Repository\dot_replacement")
black = fitz.pdfcolor["red"]
blocks=page.get_text("rawdict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
for b in blocks:
for l in b["lines"]:
for s in l["spans"]:
if s["size"]>12 and s['color']==2236191:
for c in s["chars"]:
if c["c"] == "ं":
# Extract exact bounding box coordinates
origin = fitz.Point(c["origin"])
bbox = fitz.Rect(c["bbox"])
# print(c)
page.clean_contents()
# page.wrap_contents()
# Remove original text
# re-insert same text - different color
font = fitz.Font(fontname=s['font'],fontfile=f"{s['font']}.ttf") # this must be known somehow - or simply try some font else
page.add_redact_annot(c["bbox"])
x0,y0,x1,y1=c["bbox"]
x1=x0+10 # I am getting the same position of x0 and x1 which lead to an error hence a many minor changes in his coordinate
css=f"""@font-face {{font-family: {s['font']}; src: url({s['font']}.ttf);}}* {{font-family: {s['font']};font-size:5px;color:red;}}"""
page.insert_htmlbox((x0,y0,x1,y1),c['c'],css=css,archive=arch)
# tw = fitz.TextWriter(page.rect)
# tw.append(origin, text=c["c"], font=font,fontsize=s['size'],language="mr")
# tw.write_text(page,color=black)

Apply reactions after all text replacements

page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)

Saving Option

out_filename=Path('out') /"CCI_PAGE.pdf"

save to a new PDF

doc.ez_save(out_filename,garbage=4,deflate=True)

CCI_page.pdf

JorjMcKie Feb 20, 2024
Maintainer

Yeah I feared that we were talking past each other.
The only thing that might help is (1) delete the complete old text via redaction (2) if artifacts remain (as discussed earlier), draw a white rectangle over the redact area, (3) rewrite the complete text (!) with desired different colors. This should be possible on a by-character level thanks to HTML.
There is no way to change the color of a single character by rewriting just it. An impossible task especially for these East-Asian fonts!

aleem75321 Feb 20, 2024
Author

Sir, I am unable to use redaction because drawing a white rectangle over the redact area is not happening > if this happens my issue is resolved.

can you help me to drawing white rectangle

JorjMcKie · 2024-02-20T18:35:04Z

JorjMcKie
Feb 20, 2024
Maintainer

Here you are:
test.zip

2 replies

aleem75321 Feb 20, 2024
Author

Sir very thanks for the quick reply

I need to draw a rectangle where I mark in the Image I can draw in between characters.

aleem75321 Feb 20, 2024
Author

I think i can do some play with ascender,descender what you suggested

aleem75321 · 2024-04-03T05:00:28Z

aleem75321
Apr 3, 2024
Author

@JorjMcKie Hi sir, In our previous discussion, you mention you already report a bug https://bugs.ghostscript.com/show_bug.cgi?id=707560. So I have tracked this as a bug. now this bug is resolved, But still, I am not able to Redaction

7 replies

JorjMcKie Apr 4, 2024
Maintainer

I wish you would provide data usable for reproducing the problem. A picture alone tells me close to nothing.

I also do not understand your comment: do you mean you wanted to redact the yellow-circled stuff, but the pink one also got removed?

aleem75321 Apr 4, 2024
Author

I'm very sorry for my previous comment. Please accept my apologies

#Test Files I have added 2 test files
test.pdf

test1.pdf

Issue:-Actually, when I am deleting a word, some other words are also getting deleted along with it.
The area is in the yellow circle in that photo. That word on there is the word which I had to delete and in the purple circle is the word which has been deleted, but it should not have been deleted.

Code:-

Iterate over each page of the document

for page in doc:
# Find all instances of "dot" on the current page
instances = page.search_for("ं")

# Redact each instance of "dot" on the current page
for inst in instances:
    text=page.get_textpage(clip=inst)
    # print(text.extractWORDS())
    page.add_redact_annot(inst,fill='red')

# Apply the redactions to the current page
page.apply_redactions()

Save the modified document

doc.save('out/redacted_document.pdf')
doc.close()# Iterate over each page of the document
for page in doc:
# Find all instances of "dot" on the current page
instances = page.search_for("ं")

# Redact each instance of "dot" on the current page
for inst in instances:
    text=page.get_textpage(clip=inst)
    # print(text.extractWORDS())
    page.add_redact_annot(inst,fill='red')

# Apply the redactions to the current page
page.apply_redactions()

Save the modified document

doc.save('out/redacted_document.pdf')
doc.close()

JorjMcKie Apr 4, 2024
Maintainer

Thats ok, never mind.

For mostly legal reasons, redactions will remove any character that intersects the redact rectangle. This is what is happening here.
A completely different problem than before.
You have a number of choices:

before searching, execute fitz.TOOLS.set_small_glyph_heights(True). This will reduce the height of a character bbox to the visible part (roughly).
If insufficient, you can fall back to some horizontal stripe part of the search hit rectangle, e.g. 20% of original height, located around the middle line.

aleem75321 Apr 7, 2024
Author

fitz.TOOLS.set_small_glyph_heights(True) does not work for me. can you help me show some examples of point 2 How I get 20% of original height, located around the middle line. .

aleem75321 Apr 7, 2024
Author

@JorjMcKie As you know I try to change the colour of the dots in the headline or sub-headline. I use rawdict method which accurately information then I try to write a new dot with the desired colour as you see in the image some dots don't overlap properly so I try to redact them but as you know I face issues in this also.

My task is only to change the colour of the dots in the PDF anyhow.

if you have any suggestions please give me some.

#Code
import fitz
from pathlib import Path
file_path=Path(r"test_pages/test1.pdf")
doc=fitz.open(file_path)
page=doc[0]
#Set Colour for outoput PDF
black = fitz.pdfcolor["red"]
page.clean_contents()

Extracting text Form PDF

fitz.TOOLS.set_small_glyph_heights(True)
blocks=page.get_text("rawdict",flags=fitz.TEXTFLAGS_TEXT,sort=True)["blocks"]
count=0
for b in blocks:
for l in b["lines"]:
for s in l["spans"]:
if s["size"]>15 and s['color']==2236191:
for c in s["chars"]:
if c["c"] == "ं":
count=count+1

                    # Extract exact bounding box coordinates
                    a = s["ascender"]
                    d = s["descender"]
            
                    x1,y1,x2,y2=c["bbox"]
                   
                    # page.add_redact_annot(c['bbox'],fill='red')

                    o = fitz.Point(c["origin"])  # its y-value is the baseline
                    # #Calculating font 
                    font_size=((o.y - s["size"] * d / (a - d))+a)-(o.y - s["size"] * d / (a - d) - s["size"]+d)
                    font_size=(round(font_size,0))                       
                    
                    ## re-insert same text - different color
                    font = fitz.Font(fontname=s['font'],fontfile=f"{s['font']}.ttf")  # this must be known somehow - or simply try some font else

                    tw = fitz.TextWriter(page.rect)
                    tw.append(c["origin"], text=c["c"], font=font,fontsize=font_size,language="mr")
                    tw.write_text(page,color=black)

Apply reactions after all text replacements

page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)

#Saving Backup File furture use
out_fpath="OUT/"+file_path.stem+".pdf"
doc.save(out_fpath,garbage=3, deflate=True)

#Images

#PDF
test1.pdf

want Change colour of hind vowel diacritic . #3175

Uh oh!

Uh oh!

aleem75321 Feb 18, 2024

Apply redactions after all text replacements

Saving Option

save to a new PDF

Replies: 4 comments · 18 replies

Uh oh!

JorjMcKie Feb 18, 2024 Maintainer

Uh oh!

Uh oh!

aleem75321 Feb 18, 2024 Author

Uh oh!

JorjMcKie Feb 18, 2024 Maintainer

Uh oh!

aleem75321 Feb 18, 2024 Author

Uh oh!

JorjMcKie Feb 18, 2024 Maintainer

Uh oh!

Uh oh!

aleem75321 Feb 20, 2024 Author

Apply reactions after all text replacements

Saving Option

save to a new PDF

Uh oh!

JorjMcKie Feb 20, 2024 Maintainer

Uh oh!

aleem75321 Feb 20, 2024 Author

Uh oh!

JorjMcKie Feb 20, 2024 Maintainer

Uh oh!

aleem75321 Feb 20, 2024 Author

Uh oh!

aleem75321 Feb 20, 2024 Author

Uh oh!

aleem75321 Apr 3, 2024 Author

Uh oh!

JorjMcKie Apr 4, 2024 Maintainer

Uh oh!

Uh oh!

aleem75321 Apr 4, 2024 Author

Iterate over each page of the document

Save the modified document

Save the modified document

Uh oh!

JorjMcKie Apr 4, 2024 Maintainer

Uh oh!

aleem75321 Apr 7, 2024 Author

Uh oh!

Uh oh!

aleem75321 Apr 7, 2024 Author

Extracting text Form PDF

Apply reactions after all text replacements

aleem75321
Feb 18, 2024

Replies: 4 comments 18 replies

JorjMcKie
Feb 18, 2024
Maintainer

aleem75321 Feb 18, 2024
Author

JorjMcKie
Feb 18, 2024
Maintainer

aleem75321 Feb 18, 2024
Author

JorjMcKie Feb 18, 2024
Maintainer

aleem75321 Feb 20, 2024
Author

JorjMcKie Feb 20, 2024
Maintainer

aleem75321 Feb 20, 2024
Author

JorjMcKie
Feb 20, 2024
Maintainer

aleem75321 Feb 20, 2024
Author

aleem75321 Feb 20, 2024
Author

aleem75321
Apr 3, 2024
Author

JorjMcKie Apr 4, 2024
Maintainer

aleem75321 Apr 4, 2024
Author

JorjMcKie Apr 4, 2024
Maintainer

aleem75321 Apr 7, 2024
Author

aleem75321 Apr 7, 2024
Author