Looking for font supporting Nepali: IMPLEMENTED #398

arjunpaudyal · 2019-11-10T02:58:01Z

arjunpaudyal
Nov 10, 2019

a = "बाह्यसम्पर्कतन्तु, आर्थिककेन्द्रत्वेन वर्तते" # sanskrit
aa = u"बाह्यसम्पर्कतन्तु, आर्थिककेन्द्रत्वेन वर्तते" # sanskrit
ab = "बाह्यसम्पर्कतन्तु, आर्थिककेन्द्रत्वेन वर्तते".encode('utf8') # sanskrit
b = "normal ascii" # normal ascii
c = "sòmé lâtîn!" # latin-1
e = "Euro sign €" # ascii and 1 unicode char
f = "测试" # Chinese
g = "Фёдор Михайлович Достоевский" # Russian

import fitz
doc = fitz.open()
doc.insertPage(-1) # insert first page with default settings
p = doc.loadPage(-1)

x0,y0 = 30, 50
fsize = 6

p.insertText(fitz.Point(x0, y0+10), a, fontsize = fsize) # tried aa, ab, does not work

p.insertText(fitz.Point(x0, y0+20), b, fontsize = fsize)
p.insertText(fitz.Point(x0, y0+30), c, fontsize = fsize) # this is the only thing that works

p.insertText(fitz.Point(x0, y0+50), e, fontsize = fsize)
p.insertText(fitz.Point(x0, y0+60), f, fontsize = fsize)
p.insertText(fitz.Point(x0, y0+70), g, fontsize = fsize)

m = {"author": b, "producer": c, "subject": e, "title": a, "creator": g, "keywords": a} # Code from documentation example, works on Metadata and TOC, No Issues

doc.setMetadata(m)

toc = [[1, a, 1],
[1, b, 1],
[1, c, 1],
[1, e, 1],
[1, f, 1],
[1, g, 1],
]

doc.setToC(toc)

doc.save("TTTT.pdf")

I expected the variables a,b,c, e,f,g to be transferred to page. They re not transferred. Your piece of code to set metadata, and set TOC works as expected.

Please help me.

My os is windows 10, Python 3.6, Pymupdf 1.16

Answered by JorjMcKie

Nov 27, 2023

You are free to use it immediately - without it already being released. All you have to do is import the script enabling this from an extra file like this:

import fitz
import pathlib
from htmlbox import insert_htmlbox

fitz.Page.insert_htmlbox = insert_htmlbox  # mix it into the Page object


text = "some mixture of plain text or html ..."
doc = fitz.open()
page = doc.new_page()
clip = fitz.Rect(200, 200, 500, 400)
css = "body {font-family: sans-serif;}"  # example extra styling
rc = page.insert_htmlbox(clip, text,
    css=None,
    rotate=0,  # one of 0, 90, 180, 270
    adjust=True,  # whether to reduce font size until text fits in clip
    morph=None,
    overlay=True,
)
print(rc)  # f…

View full answer

arjunpaudyal · 2019-11-10T02:59:51Z

arjunpaudyal
Nov 10, 2019
Author

TTTT.pdf
This is the output of above code. I expected the variables to be written in PDF. Pleas let me know - how to do it.

0 replies

JorjMcKie · 2019-11-10T09:14:40Z

JorjMcKie
Nov 10, 2019
Maintainer

In page.insertText(...) and page.insertTextbox(...) use a font which supports the characters used in the text. The Base14 fonts ("Helvetica", "Times-Roman", "Courier") only support character codes less than 256.
On Windows you probably have installed a lot of fonts for other character sets like for Chinese, Japanese, Korean or Sanskrit. You must find out yourself which one you need.

Here is an example snippet which puts some text in Hindi in a PDF:

devanagari-text.zip

0 replies

arjunpaudyal · 2019-11-10T18:22:08Z

arjunpaudyal
Nov 10, 2019
Author

Thank you JorjMcKie.

It solved some of the problems, while it faced new challenges. Text is not properly rendered in the generated PDF. In attached image, I want the Expected line as is in PDF, not the second line.

Sadly, when I copy the text generated in PDF (your attachment) to some other text editor with capabilities of UTF-8 encoding (windows notepad, and Notepad++), it works as expected. I generated the text back from PDF with added code text = doc.getPageText(0) # doc: our document with unicode devnagari insertion, python IDLE prints the text in expected manner.

Thank you.

0 replies

JorjMcKie · 2019-11-10T18:40:02Z

JorjMcKie
Nov 10, 2019
Maintainer

Please post the script you are working with here.
Well, maybe the font I used has an issue. It was meant to just demonstrate, how things work.
Baseline is, that you need to find a font which works for you.

0 replies

JorjMcKie · 2019-11-10T19:45:10Z

JorjMcKie
Nov 10, 2019
Maintainer

The fact, that page.getText() correctly prints the original text, shows that it is correctly coded in the PDF.
But the translation to the glyphs (graphical appearance of each character) is incorrect. So we are still using the wrong font. The font is responsible for exactly doing this: translating the character code (chr(char)) to some graphical representation, called the "glyph".
Same thing happens if you insert "Hello, world!" using the ZapfDingBats font for example.

0 replies

arjunpaudyal · 2019-11-10T20:51:45Z

arjunpaudyal
Nov 10, 2019
Author

SampleWithFonts.zip

I tested with multiple fonts that (do not)/have Devnagari block. Despite issues in page rendering, Unicode Text in TOC works fine as expected. Same as metadata. Attached file is nearly 2 MB as it includes all the fonts.

import fitz

fontfiles = [
    ("Devnagari", "./includes/DEVAB_.TTF", 150, 100),       # No Unicode Devnagari block
    ("Mangal", "./includes/mangal.ttf", 150, 130),          # Yes
    ("Sanskrit", "./includes/Sanskr.ttf", 150, 160),        # yes
    ("Utsaah", "./includes/utsaah.ttf", 150, 190),          # Yes
    ("Himalaya", "./includes/himalaya.ttf", 150, 220),      # No Devnagari block
    ("Aparajita", "./includes/aparaj.ttf", 150, 250),       # Yes
    ]

doc = fitz.open()  # new PDF
fsize = 10  # fontsize

engText = "Thank you JoriMcKie for free knowledge."
nepText = "नि:शुल्क ज्ञानको लागी JoriMcKie लाई धन्यबाद |"

page = doc.newPage()

page.insertText(fitz.Point(150, 70), engText) # use base 14 default point

for finfo in fontfiles:
    fname, ffile, x0, y0 = finfo   # 
    fsize = 10                     # define font size 
    point = fitz.Point(x0,y0)      # define insertion point

    # insert the font on the page
    page.insertFont(fontname=fname, fontfile=ffile)
    # TODO: Test: if the atcual font name, or coder-specified name makes difference
    

    # insert Font Name:
    page.insertText(fitz.Point(60, y0), fname)

    # inset unicode text
    page.insertText(point, nepText, fontname=fname,fontsize=fsize )
    
# insert actual image in pdf for comparison
rect = fitz.Rect(150, y0+30, 50+363, y0+30+49) # I know the dimension of image
pix = fitz.Pixmap("./includes/Expected.png")
page.insertImage(rect, pixmap=pix, overlay=True)

# Rendering in Page Failed, try with page Title
m = {"title": nepText,}
doc.setMetadata(m) # renders good

# Try to set in Page TOC:
TOC = [[1, nepText, 1]]
doc.setToC(TOC)

"""
Problem:
    page.insertText did not work as expected. Rendering is not proper.

Contrary:
    setMetadata and InserToC work fine as expected.

FunFact:
    Copied text from im-properly rendered text, is rendered properly in :
        IDLE display,
        Text Editors,
        Chrome Browser

TODO:
    Check on Raspbian
"""



doc.save(__file__ + ".pdf", garbage=3, deflate=True)

0 replies

JorjMcKie · 2019-11-10T21:18:47Z

JorjMcKie
Nov 10, 2019
Maintainer

The rendering of metadata and TOC entries works, because of the PDF-internal mechanism.
So no surprise there.
I am hesitating to download the big PDF: did you find any font at all which works satisfactorially?

0 replies

arjunpaudyal · 2019-11-13T00:12:08Z

arjunpaudyal
Nov 13, 2019
Author

@JorjMcKie
I have tested couple of tests to figure out where is the problem. I did not succeed but it might be helpful to narrow down the scope of problem. Mostly it is Windows based.

Test Text = नि:शुल्क ज्ञानको लागी JoriMcKie लाई धन्यबाद |

Print to PDF from Notepad using Microsoft Print to PDF.)
Use LibreOffice Writer to generate PDF (Export to PDF)

In both cases, it worked fine. Font embedded. (Microsoft PDF - embedded CID+F1 Font, while LibreOffice embedded the actual font - Aparajita, Mangal or so on.

I can confirm that - it is not the Font issue.

Encoding:
PDF generated from PyMuPDF is encoded as Identity-H. I

I need to test something related to Encoding.

0 replies

JorjMcKie · 2019-11-13T11:03:22Z

JorjMcKie
Nov 13, 2019
Maintainer

Ah okay, I see. Microsoft also uses Identity-H, while Libre doesn't, as you write.
It is an issue with (Py-) MuPDF's font handling / support then.
As I wrote: it cannot be a unicode support problem, because you can extract the original text in its correct, original form.

I am afraid I am out of advice here now. Especially because not all glyphs are wrong - just some, right? What I am doing in PyMuPDF is converting a unicode character to a 2 or 4 byte hexadecimal string. Which integer to take for this conversion depends on the font characteristics which I get from MuPDF. So I suspect the issue is inside MuPDF ☹.

0 replies

JorjMcKie · 2019-11-13T14:45:03Z

JorjMcKie
Nov 13, 2019
Maintainer

When comparing the PDF output of LibreOffice, Word and PyMuPDF, I can see, that the office software kind of know when to combine two or more characters in a row into one new joint glyph. If you look at the /ToUnicode object for the fonts there, you will find things like (Libre example):

16 beginbfchar
<01> <0928093F>  % 2 unicodes form one glyph, addressed as 0x01
<03> <0936>
<04> <0941>
<05> <0932094D0915>  % 3 unicodes form one glyph, addressed as 0x05
<06> <091C094D091E>  % 3 unicodes form one glyph, addressed as 0x06
<07> <093E>
<08> <0915>
<09> <094B>
<0A> <0932>
<0B> <0917>
<0C> <0940>
<0D> <0908>
<0E> <0927>
<0F> <0928094D092F>  % 3 unicodes form one glyph, addressed as 0x0F
<10> <092C>
<11> <0926>
endbfchar

Doing this kind of thing requires knowledge of the underlying language of course.
In contrast, PyMuPDF translates each single character of a text string separately to a glyph ... and will consequently fail to produce the right appearance in some cases.
As I said: I have currently no idea how to address this issue, and it may well be that there will never be a solution I am afraid.

0 replies

JorjMcKie · 2019-11-14T07:14:10Z

JorjMcKie
Nov 14, 2019
Maintainer

A take-away from our discussion is that I should more closely look at PyMuPDF's handling of text insertion. You have put your finger on a weakness there.

In the meantime, I have found that the base library MuPDF does support this type of thing - how much is something to find out.
Also found confirmed that improvements are only possible by offering a way to specify the language of a text - the font alone is not sufficient as we have seen with your examples.

Thank you for bringing this up!

0 replies

arjunpaudyal · 2019-11-14T15:12:49Z

arjunpaudyal
Nov 14, 2019
Author

@JorjMcKie Is it somehow related to the default encoding set (encoding=0) in the text insertion ?
from FAQ: (https://pymupdf.readthedocs.io/en/latest/faq/)

The valid encoding values are TEXT_ENCODING_LATIN (0), TEXT_ENCODING_GREEK (1), and TEXT_ENCODING_CYRILLIC (2, Russian) with Latin being the default. Encoding can be specified by all relevant font and text insertion methods.

I wanted to test it by setting it to Devnagari, or remove the default encoding. I do not know how-to for both of the concerns.

(PS: I figured - encoding = 4, 5, or even 6 - that are not available, work without producing error, but no change in PDF file).

0 replies

JorjMcKie · 2019-11-15T19:47:37Z

JorjMcKie
Nov 15, 2019
Maintainer

The valid encoding values are ...

These flags only play a role for the 3 Base14 PDF fonts Times-Romas, Courier and Helvetica ... to create variants for other than Latin encodings.
In every other case they are ignored.

0 replies

JorjMcKie · 2019-11-19T13:50:51Z

JorjMcKie
Nov 19, 2019
Maintainer

Sorry - didn't mean to close this!

0 replies

arjunpaudyal · 2019-11-19T14:00:36Z

arjunpaudyal
Nov 19, 2019
Author

@JorjMcKie

How does (Py-)MuPDF handle Ligatures in font ? I was looking for it in MuPDF documentation, that talks about the extraction-handling only.

Core-problem here, as i figured out is the ligatures defined in font are not handled well, so the letters as shown as typed.

0 replies

PushpaYa · 2020-07-08T14:35:10Z

PushpaYa
Jul 8, 2020

Hi ,

unicode to Krutidev translation is needed for Hindi langauge
http://wrd.bih.nic.in/font_KtoU.htm
https://docs.microsoft.com/en-us/typography/script-development/devanagari

Thanks,
Pushpa

0 replies

JorjMcKie · 2020-07-20T15:41:58Z

JorjMcKie
Jul 20, 2020
Maintainer

Thanks for the information. I have read over it a few times, and what I finally understood is this (and maybe I am still wrong ...):

Writing Devanagari text to a PDF is not trivial as is character-by-character. On the contrary, there exist complex rules, which determine when and how single glyphs have to be reordered in the presence of so-called "half-forms".
Someone / something must take over the task to (1) analyze the incoming text, and (2) put out a new sequence of glyphs, which is being built based on the previous rule.

This seems to be what the Indic shaping engine on the provided MS web site does.
I know of no such engine which I could use. Apart from the fact that I am not at all certain about the point in time to call such engine.

Do you have access to such an engine?
Is it possible for you to pre-process your text with it and then output it using PyMuPDF?

0 replies

JorjMcKie · 2020-07-22T13:46:31Z

JorjMcKie
Jul 22, 2020
Maintainer

I am experimenting with the following:
Try to develop an algorithm that re-orders the characters in an Devanagari string before putting it on a PDF page. I took your string and looked at its unicode by character:

>>> text = "नि:शुल्क ज्ञानको लागी लाई धन्यबाद"
>>> for c in text:
	print(c, "=", ord(c))

	
न = 2344
ि = 2367
: = 58
श = 2358
ु = 2369
ल = 2354
् = 2381
क = 2325
  = 32
ज = 2332
् = 2381
ञ = 2334
ा = 2366
न = 2344
क = 2325
ो = 2379
  = 32
ल = 2354
ा = 2366
ग = 2327
ी = 2368
  = 32
ल = 2354
ा = 2366
ई = 2312
  = 32
ध = 2343
न = 2344
् = 2381
य = 2351
ब = 2348
ा = 2366
द = 2342
>>>

Then I tried this:

def switch_chars(text):
    halves = (2367, 2366, 2379, 2368)  # unicodes that seem to modfy the preceeding char
    newtext = [c for c in text]

    for i in range(1, len(text)):
        ordi = ord(newtext[i])
        if ordi in halves:  # revert the sequ with previous char
            newtext[i - 1], newtext[i] = newtext[i], newtext[i - 1]
    return "".join(newtext)

When I then write the string switch_chars(text) instead of text, the output starts look better, although it is still far from the final goal:
very wrong:

somewhat better (chieved with the above):

final goal:

For me this looks like a promising path to go. I think you are in a better position to help complete the rules that must be implemented in some function like switch_chars above ...!

0 replies

arjunpaudyal · 2020-08-07T21:41:21Z

arjunpaudyal
Aug 7, 2020
Author

I have not tried, but i think it will be a challenging idea, though not impossible.

some issues are font level.

Take an example : ज्ञ seems to be single character (and is in language). (taken from : ज्ञानको)
(1 byte in language, but 3 bytes in unicode)

from Unicode Table, it is formed with combination of 3 characters : ज = 2332 + ् = 2381 + ञ = 2334
again, same complex character is furthur modified to make it : ज्ञा, which is combination of ज = 2332 + ् = 2381 + ञ = 2334 + ा = 2366

Each font is free to define its ligature in its codepage. I think this is font level, meaning it may NOT have a global integer value.

You are already one step ahead than i thin i could help you.

0 replies

JorjMcKie · 2020-08-08T10:32:54Z

JorjMcKie
Aug 8, 2020
Maintainer

i think it will be a challenging idea, though not impossible.

I agree. The Microsoft website you pointed me to explains exactly such an algorithm (on an abstreact level unfortunately). Obviously a complex algorithm, and your recent post confirms this.
However, such an algorithm is doable, that's for sure - it is implemented in LibreOffice and all the other examples.
The problem is, finding the source code of such an algorithm. The rest would be more or less straightforward.
Maybe I can take allok at LibreOffice source. Should be possible, because it is open source free software.

0 replies

sravanthi-upadrasta · 2023-07-09T11:47:16Z

sravanthi-upadrasta
Jul 9, 2023

@JorjMcKie
We are facing same issue when using fonts for Devanagari. Just want to check whether you could find a solution for the above mentioned issue. Appreciate your response.

1 reply

JorjMcKie Jul 14, 2023
Maintainer

Unfortunately, there has been no progress in this area. We arre still at he same point 😒.

ousia · 2023-07-15T19:43:11Z

ousia
Jul 15, 2023

Sorry, but I wonder whether the issue here is a missing text shaping engine.

If this is the case, Harfbuzz is a well-known open source shaping engine and it also has Python bindings.

Just in case it might help.

6 replies

ousia Nov 3, 2023

@arjunpaudyal,

I guess this would be a pending integration (to say the least).

Python bindings for Harfbuzz are available https://github.com/harfbuzz/uharfbuzz.

I’m afraid that main limitation for many developers is that they are only experienced with Latin script (or similar ones, such as Greek and Cyrillic scripts).

This would be my case (if I were be able to code 😅).

Just in case it might help.

subalalithafl Nov 13, 2023

Solution provided in the below discussion may help?
https://groups.google.com/g/pdfium/c/pwtg4PBZekU

I had hard time getting around the solution proposed, may be you can understand @JorjMcKie ?

ousia Nov 14, 2023

I had hard time getting around the solution proposed, may be you can understand @JorjMcKie?

Sorry, @arjunpaudyal, but the last part of the sentence is ambiguous to me.

If you ask whether I understand what @JorjMcKie may be proposing, I’m afraid absolutely not (I wish I could).

Sorry again, but I cannot even code (not to mention my basic lack of programming skill).

subalalithafl Nov 15, 2023

Sorry I think I was not clear in my statement. I was asking if @JorjMcKie can understand the solution proposed in the Google Groups post(link i posted earlier) to use Harfbuzz with PDF

JorjMcKie Nov 15, 2023
Maintainer

@subalalithafl - sorry for getting into this discussion so late. Please give me some more time, I am very busy currently 🤷‍♂️.

JorjMcKie · 2023-11-25T11:28:07Z

JorjMcKie
Nov 25, 2023
Maintainer

I just confirmed, that PyMuPDF's Story feature does support Harfbuzz!

This feature uses PyMuPDF in analogy to an internet browser:
As with a browser, input is any combination of HTML and CSS (for styling, like bold, italic, text color, etc.), but output is a PDF (potentially with multiple pages) - not a web page.

Here is a small, but complete example: Text copied from the Indian Wikipedia home page put into a simple html, and a script that outputs it.
hindi.zip

7 replies

subalalithafl Nov 25, 2023

@JorjMcKie Thanks a lot for this. Yes it is indeed working to create a PDF in Indic languages. I tried Tamil as well. But one issue though, I am not able to select the text properly.
I need to check if this will work with the font replacement example given. I am using the example(replfont files) to replace English text with translated text in Tamil. I need to go through the Story to check if that is feasible.

JorjMcKie Nov 25, 2023
Maintainer

what do you mean "select the text properly"?

JorjMcKie Nov 25, 2023
Maintainer

If you replace a font with that script, then the story feature is not active - so the Harfbuzz mechanism is not invoked.

subalalithafl Nov 25, 2023

I mean text selection in the PDF. Highlighting a text and copying it.

JorjMcKie Nov 25, 2023
Maintainer

what is going wrong when doing this?

subalalithafl · 2023-11-25T12:28:01Z

subalalithafl
Nov 25, 2023

Hindi seem to be good. I tried Tamil. This is what I am getting.

Below is the PDF generted with Tamil Text. It looks good.

Now if I try to select the text using pointer or just simply do CTRL+A, I get below. Not all text is selected as you can see

If I copy the same and put it in a textpad, I do not see the proper Tamil text. It is broken like below

1 reply

JorjMcKie Nov 25, 2023
Maintainer

ok - I understand
I am writing a post further down that may help. So that you do not need manual copy.

JorjMcKie · 2023-11-25T12:47:14Z

JorjMcKie
Nov 25, 2023
Maintainer

Because we need the Story feature taking control, the output cannot as easily handled as with traditional ways of text writing. But there are ways to deal with some situations.
If you want to place text output created by Story (Tamil, Hindi, Chinese, whatever) into a certain position (say: clip) of an existing PDF page:

Have the story use a temporary PDF + page as output - with an appropriately chosen page size
Take your existing page / clip to insert that Story output via page.show_pdf_page().

import fitz
import io

def story_maker(clip, text):  # make a PDF of page size "clip" containing text
    clip = fitz.Rect(clip)  # ensure we have a Rect
    clip += (-clip.x0, -clip.y0, -clip.x0, -clip.y0)
    more = True
    while more:
        fp = io.BytesIO()  # use for file output
        writer = fitz.DocumentWriter(fp)
        story = fitz.Story(html=text)
        mediabox = clip
        dev = writer.begin_page(mediabox)
        more, _ = story.place(mediabox)
        if more:  # text did not fit in this clip, so enlarge and try again
            clip *= 1.05  # enlarge by 5%
            continue
        story.draw(dev)  # good, we stayed inside the clip
        writer.end_page()  # end the page
        writer.close()  # close the writer
        break  # lave the loop
    doc = fitz.open("pdf", fp)  # make a PDF from memory
    return doc

tamil = ""  # text in Tamil language, with or without styling

temp_pdf = story_maker(clip, tamil)
page.show_pdf_page(clip, temp_pdf, 0)

2 replies

subalalithafl Nov 25, 2023

Thanks for the pointer. I will try this out with my scenario. It will take some time. Will reply back if facing issues.

JorjMcKie Nov 25, 2023
Maintainer

Note: I have just tested / corrected the above function story_maker().

JorjMcKie · 2023-11-27T10:50:38Z

JorjMcKie
Nov 27, 2023
Maintainer

Please see this announcement.
Once that thing is release, we can finally close this discussion topic here!

6 replies

sanchayjain28 Nov 27, 2023

That's great @JorjMcKie .
How can I help you and could you please tell me when will you release that update .

JorjMcKie Nov 27, 2023
Maintainer

You are free to use it immediately - without it already being released. All you have to do is import the script enabling this from an extra file like this:

import fitz
import pathlib
from htmlbox import insert_htmlbox

fitz.Page.insert_htmlbox = insert_htmlbox  # mix it into the Page object


text = "some mixture of plain text or html ..."
doc = fitz.open()
page = doc.new_page()
clip = fitz.Rect(200, 200, 500, 400)
css = "body {font-family: sans-serif;}"  # example extra styling
rc = page.insert_htmlbox(clip, text,
    css=None,
    rotate=0,  # one of 0, 90, 180, 270
    adjust=True,  # whether to reduce font size until text fits in clip
    morph=None,
    overlay=True,
)
print(rc)  # float as in insert_textbox

doc.subset_fonts()  # recommended
doc.ez_save("output.pdf")

This is the code imported above:
htmlbox.zip

Answer selected by JorjMcKie

sanchayjain28 Nov 27, 2023

Thank you @JorjMcKie this helped me a lot and can I change font color here..?

arjunpaudyal Jan 9, 2024
Author

@JorjMcKie This seems to insert the complex language scripts with complex language glyphs.

I am struggling to insert the font (of my choice) and get the same result as it does by default.

Unlike the textbox method, the insert_htmlbox does not take font input.

Apart from the inserting font, it is solved and solution tested on :
'''
os : Win11
Python : 3.12
pymupdf : ('1.23.8', '1.23.7', '20231219000001')
known requirement :
fonttools : pip(3) install fonttools
'''

JorjMcKie Jan 9, 2024
Maintainer

At @arjunpaudyal re this post:
What did you try to do? Use page.insert_htmlbox()?

arjunpaudyal Jan 9, 2024
Author

@JorjMcKie I copied your custom function. A bit messier as i tested multiple things at once. Devanagari Unicode character rendering is working fine though. Thank you.

full working code is here :

import fitz
import io


def insert_htmlbox( page, rect, text, rotate=0, oc=0, adjust=True, overlay=True, morph=None, css=None ):
    def story_maker(clip, text, adjust=True, rotate=0, css=css):
        rect = fitz.Rect(clip)  # copy to rect
        rect += (-clip.x0, -clip.y0, -clip.x0, -clip.y0)
        if rotate in (90, 270):
            rect.x1, rect.y1 = rect.y1, rect.x1
        orig_height = rect.y1
        mycss = "body {margin:1px;}" + css
        more = True
        while more:
            fp = io.BytesIO()  # use for file output
            writer = fitz.DocumentWriter(fp)
            story = fitz.Story(html=text, user_css=mycss)
            dev = writer.begin_page(rect)
            more, filled = story.place(rect)
            if more:  # text did not fit in this clip, so enlarge and try again
                rect *= 1.01  # enlarge by 1%
                continue
            story.draw(dev)  # good, we stayed inside the clip
            writer.end_page()  # end the page
            writer.close()  # close the writer
            break  # leave the loop

        try:
            doc = fitz.open("pdf", fp)  # make a PDF from memory
        except:
            fitz.FZ_MAX_INF_RECT
            return None, fitz.FZ_MAX_INF_RECT
        if adjust is False and rect.y1 > orig_height:
            doc.close()
            return None, rect.y1
        return doc, filled[3] - filled[1]

    while rotate < 0:
        rotate += 360
    while rotate >= 360:
        rotate -= 360
    rect = fitz.Rect(rect)
    if css is None:
        css = ""
    doc, height = story_maker(rect, text, adjust=adjust, rotate=rotate, css=css)

    rc = rect.height - height if rotate in (0, 180) else rect.width - height
    if doc is not None:
        page.show_pdf_page(
            rect, doc, 0, rotate=rotate, oc=oc, overlay=overlay, morph=morph
        )
    return rc

### End of custom function



doc = fitz.open()
page = doc.new_page()

# todo:
# fname="Devnagari"
# ffile = r"D:\WindowsData\Desktop\Font Development\Fotns\_EXTRACTED\001AG___.TTF"
# page.insert_font(fontname=fname, fontfile=ffile)
# try to attempt to insert custom font

text1 = "फ्याक्ट : बिक्री क्ष ज्ञ  च्छ द्य श्र श्च  द्ध द्भ त्त त्त्र u\"बाह्यसम्पर्कतन्तु, आर्थिककेन्द्रत्वेन वर्तते\" "
text2 = f"नि:शुल्क ज्ञानको लागी लाई धन्यबाद Testing  \r\n\t CR/LF before me is lost  <br>I am on new line.<br>I support Html Css Inline style.<p style=\"color:blue;\">I am blue</p>"


fitz.Page.insert_htmlbox = insert_htmlbox  # mix it into the Page object


clip1 = fitz.Rect(50, 100, 400, 300)
css = "body {font-family: sans-serif;}"  # example extra styling
rc = page.insert_htmlbox(clip1, text1,
    css=None,
    rotate=0,  # one of 0, 90, 180, 270
    adjust=True,  # whether to reduce font size until text fits in clip
    morph=None,
    overlay=True,
    #fontname=fname,
)
print(rc)  # float as in insert_textbox

clip2 = fitz.Rect(50, 200, 500, 750)
rc2 = page.insert_htmlbox(clip2, text2,
    css=None,
    rotate=0,  # one of 0, 90, 180, 270
    adjust=True,  # whether to reduce font size until text fits in clip
    morph=None,
    overlay=True,
    #fontname=fname,
)
print(rc2)  # float as in insert_textbox

doc.subset_fonts()  # recommended
doc.ez_save("DevUni.pdf")

JorjMcKie · 2023-11-27T18:09:12Z

JorjMcKie
Nov 27, 2023
Maintainer

all you need is some knowledge about HTML and styling with CSS

5 replies

sanchayjain28 Nov 27, 2023

Thank you @JorjMcKie

sanchayjain28 Nov 29, 2023

Hello @JorjMcKie could you please help me..
I want to erase text from the pdf and replace it from other text using insert_htmlbox()
There is a method of erasing the background
page.add_redact_annot(bbox,fill=(1,1,1))
page.apply_redactions()
but this works only for solid background
If the background is non-uniform ,then what should I use

JorjMcKie Nov 29, 2023
Maintainer

@sanchayjain28 - I am not sure what you mean by "solid" background?

Redactions can erase text, links, and image portions - not vector graphics.
On the other hand, you are filling the redaction area with white - so you should be fine anyway ...

JorjMcKie Nov 29, 2023
Maintainer

BTW - if you are not referring to the language problem itself, it would be better to open a different discussion.

sanchayjain28 Nov 29, 2023

Yes,my issue is different and not referring to language problem
I want to replace the text blocks of pdf text with another text and for doing that first I have to delete old text then inserting new text ( that is done by the code you have sent) and I need help to delete ,erase the text so that background of the pdf remain same (as all pdf have different background i.e of same colour, non-uniform colour )

JorjMcKie · 2023-11-29T14:13:53Z

JorjMcKie
Nov 29, 2023
Maintainer

Yes,my issue is different and not referring to language problem I want to replace the text blocks of pdf text with another text and for doing that first I have to delete old text then inserting new text ( that is done by the code you have sent) and I need help to delete ,erase the text so that background of the pdf remain same (as all pdf have different background i.e of same colour, non-uniform colour )

Ok, I see.
As I wrote, that should work if there are no vector graphics in the background. Of course you must not use the fill parameter in the redactions.

0 replies

sanchayjain28 · 2023-11-29T14:17:45Z

sanchayjain28
Nov 29, 2023

5 replies

JorjMcKie Nov 29, 2023
Maintainer

It will work - it is documented, I just looked it up exclusively for you! Use fill=False.

JorjMcKie Nov 29, 2023
Maintainer

cannot be - works for me:
original page:

Then make a green background and erase all "pixmap" occurrences:

import fitz
doc=fitz.open("v110-changes.pdf")
page=doc[0]
page.draw_rect(page.rect, fill=(0,1,0),overlay=False)
Point(0.0, 0.0)
for r in page.search_for("pixmap"):
    page.add_redact_annot(r,fill=False)

    
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
'Redact' annotation on page 0 of v110-changes.pdf
page.apply_redactions()
True
doc.ez_save("z.pdf")

Result:

JorjMcKie Nov 29, 2023
Maintainer

can you share an example page?

JorjMcKie Dec 1, 2023
Maintainer

Try this:
new-pdf.zip

The key thing is that you have an image background. The redaction application default is punching holes in images that overlap a redaction.
I used the "images" option to prevent this from happening ...
Reading the documentation is a good idea in many cases ... 😉.

sanchayjain28 Dec 1, 2023

Thank You @JorjMcKie

Looking for font supporting Nepali: IMPLEMENTED #398

Uh oh!

Replies: 41 comments · 33 replies

Uh oh!

Uh oh!

arjunpaudyal Nov 10, 2019 Author

Uh oh!

JorjMcKie Nov 10, 2019 Maintainer

Uh oh!

arjunpaudyal Nov 10, 2019 Author

Uh oh!

Uh oh!

JorjMcKie Nov 10, 2019 Maintainer

Uh oh!

JorjMcKie Nov 10, 2019 Maintainer

Uh oh!

arjunpaudyal Nov 10, 2019 Author

Uh oh!

JorjMcKie Nov 10, 2019 Maintainer

Uh oh!

arjunpaudyal Nov 13, 2019 Author

Uh oh!

JorjMcKie Nov 13, 2019 Maintainer

Uh oh!

Uh oh!

JorjMcKie Nov 13, 2019 Maintainer

Uh oh!

JorjMcKie Nov 14, 2019 Maintainer

Uh oh!

arjunpaudyal Nov 14, 2019 Author

Uh oh!

JorjMcKie Nov 15, 2019 Maintainer

Uh oh!

JorjMcKie Nov 19, 2019 Maintainer

Uh oh!

Uh oh!

arjunpaudyal Nov 19, 2019 Author

Uh oh!

Uh oh!

Uh oh!

JorjMcKie Jul 20, 2020 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jul 22, 2020 Maintainer

Uh oh!

arjunpaudyal Aug 7, 2020 Author

Uh oh!

JorjMcKie Aug 8, 2020 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jul 14, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JorjMcKie Nov 15, 2023 Maintainer

Uh oh!

Replies: 41 comments 33 replies

arjunpaudyal
Nov 10, 2019
Author

JorjMcKie
Nov 10, 2019
Maintainer

arjunpaudyal
Nov 10, 2019
Author

JorjMcKie
Nov 10, 2019
Maintainer

JorjMcKie
Nov 10, 2019
Maintainer

arjunpaudyal
Nov 10, 2019
Author

JorjMcKie
Nov 10, 2019
Maintainer

arjunpaudyal
Nov 13, 2019
Author

JorjMcKie
Nov 13, 2019
Maintainer

JorjMcKie
Nov 13, 2019
Maintainer

JorjMcKie
Nov 14, 2019
Maintainer

arjunpaudyal
Nov 14, 2019
Author

JorjMcKie
Nov 15, 2019
Maintainer

JorjMcKie
Nov 19, 2019
Maintainer

arjunpaudyal
Nov 19, 2019
Author

JorjMcKie
Jul 20, 2020
Maintainer

JorjMcKie
Jul 22, 2020
Maintainer

arjunpaudyal
Aug 7, 2020
Author

JorjMcKie
Aug 8, 2020
Maintainer

JorjMcKie Jul 14, 2023
Maintainer

JorjMcKie Nov 15, 2023
Maintainer