Map every font character #940

victor-ab · 2021-03-10T17:16:56Z

victor-ab
Mar 10, 2021

Hi!

I want to solve the bad char mapping using some kind of OCR tool, like mentioned in issue #365

Is there a more efficient way to get a set of all the chars glyphs ?

chars = []
bboxes = []
blocks = page.getText("rawdict")["blocks"]
text_blocks = [i for i in blocks if i['type']==0]
for b in text_blocks:
    for l in b["lines"]:
        for s in l["spans"]:
            for char in s["chars"]:
                if char["c"] not in chars:
                    chars.append(char["c"])
                    bboxes.append(char['bbox'])

for bbox in bboxes:
    pixmap = page.get_pixmap(clip=fitz.Rect(bbox),matrix = fitz.Matrix(3,3))
    display(Image.open(BytesIO(pixmap.getImageData())))

Answered by JorjMcKie

Mar 10, 2021

This is really cool stuff!
For at least as long as there is no direct OCR support in PyMuPDF - if not even beyond that - I would like your solution being included in the recipes of the documentation or in the PyMuPDF-Utilities repo.
Because of the heavy package calibers underneath easyocr (pytorch and friends), there is no way to make it an integral part of PyMuPDF.

View full answer

JorjMcKie · 2021-03-10T17:31:05Z

JorjMcKie
Mar 10, 2021
Maintainer

text_blocks = [i for i in blocks if i['type']==0]

More efficient would be using page.getText("rawdict", flags=0)["blocks"]. Not only saves you checking the block type yourself, but instead will prevent to even extract the images, thus reducing the size of the returned result significantly.
Otherwise perfect.

0 replies

victor-ab · 2021-03-10T18:22:16Z

victor-ab
Mar 10, 2021
Author

It worked:

from PIL import Image
from io import BytesIO, StringIO
import easyocr

def check_bad_char(char):
    category = unicodedata.category(char)
    is_bad_char = ((category =='Co') or (ord(char)==65533))
    return is_bad_char


text_blocks = page.getText("rawdict", flags=0)["blocks"]
chars = []
bboxes = []
for b in text_blocks:
    for l in b["lines"]:
        for s in l["spans"]:
            for char in s["chars"]:
                if char["c"] not in chars and check_bad_char(char["c"]):
                    chars.append(char["c"])
                    bboxes.append(char['bbox'])
                    
easyocr_reader = easyocr.Reader(['en'])

map_char = {}
for char, bbox in zip(chars, bboxes):
    pixmap = page.get_pixmap(clip=fitz.Rect(bbox),matrix = fitz.Matrix(4,4))
    image_data = pixmap.getImageData()
    char_ocr = easyocr_reader.recognize(image_data,detail=0)[0]
    map_char[char] = char_ocr

But I think it will be better to do the ocr by word to avoid replacing simple mistakes like "1" and "I", "O and "0".

Check this output:

{k: v for k, v in sorted(map_char.items(), key=lambda item: item[0])}

{ '\uf020': '',
 '\uf028': '(',
 '\uf029': '"',
 '\uf02c': ',',
 '\uf02d': '-',
 '\uf02e': '.',
 '\uf02f': '/',
 '\uf030': '0',
 '\uf031': '1',
 '\uf032': '2',
 '\uf033': '3',
 '\uf034': '4',
 '\uf035': '5',
 '\uf036': '6',
 '\uf037': '7',
 '\uf038': '8',
 '\uf039': '9',
 '\uf03a': ':',
 '\uf040': '@',
 '\uf041': 'A',
 '\uf042': 'B',
 '\uf043': 'C',
 '\uf044': 'D',
 '\uf045': 'E',
 '\uf046': 'F',
 '\uf047': 'G',
 '\uf048': 'H',
 '\uf049': '1',
 '\uf04a': 'J',
 '\uf04b': 'K',
 '\uf04c': 'L',
 '\uf04d': 'M',
 '\uf04e': 'N',
 '\uf04f': '0',
 '\uf050': 'P',
 '\uf051': '0',
 '\uf052': 'R',
 '\uf053': 'S',
 '\uf054': 'T',
 '\uf055': 'U',
 '\uf056': 'V',
 '\uf057': 'W',
 '\uf058': 'X',
 '\uf059': 'Y',
 '\uf05a': 'z',
 '\uf061': 'a',
 '\uf062': 'b',
 '\uf063': 'c',
 '\uf064': 'd',
 '\uf065': 'e',
 '\uf066': 'f',
 '\uf067': '9',
 '\uf068': 'h',
 '\uf069': 'i',
 '\uf06b': 'k',
 '\uf06c': 'I',
 '\uf06d': 'm',
 '\uf06e': 'n',
 '\uf06f': 'o',
 '\uf070': 'p',
 '\uf072': 'r',
 '\uf073': 's',
 '\uf074': 't',
 '\uf075': 'u',
 '\uf076': 'v',
 '\uf077': 'w',
 '\uf078': 'x',
 '\uf079': 'y'}

3 replies

JorjMcKie Mar 10, 2021
Maintainer

That's cool!

But I think it will be better to do the ocr by word to avoid replacing simple mistakes like "1" and "I", "O and "0".

True. Or use "dict" instead of "rawdict" and check whether the span contains bad characters.
In either case it means that the page position of the translated, originally invalid character can only be determined with significant effort.
Did not know easyocr. Have to check it out and see which language coverage it has.
In any case, your idea is very worth to be formed into a recipe!

victor-ab Mar 10, 2021
Author

Yep, we can do something like this as well:

from PIL import Image
from io import BytesIO, StringIO
import easyocr

def check_bad_char(char):
    category = unicodedata.category(char)
    is_bad_char = ((category =='Co') or (ord(char)==65533))
    return is_bad_char

doc = fitz.open(pdf_path)
page = doc[0]

easyocr_reader = easyocr.Reader(['en'])


text_blocks = page.getText("dict", flags=0)["blocks"]
texts = []
bboxes = []
for b in text_blocks:
    for l in b["lines"]:
        for s in l["spans"]:
            if any([check_bad_char(char) for char in s['text']]):
                bbox = fitz.Rect(s['bbox'])
                pixmap = page.get_pixmap(clip=bbox,matrix = fitz.Matrix(4,4))
                image_data = pixmap.getImageData()
                detected_text = easyocr_reader.readtext(image_data,detail=0)
                if len(detected_text)>0:
                    s['text'] = detected_text[0]

Worked flawlessly here. Thanks for your input!

shiu886 Nov 15, 2022

Won't iterrate on all spans/chars returned by page.get_texttrace() be equivalent if not easier?
BTW, Is there a way to get pixmap or image data from glyph id of a char instead using page.get_pixmap() and bbox of that char?

JorjMcKie · 2021-03-10T20:03:16Z

JorjMcKie
Mar 10, 2021
Maintainer

This is really cool stuff!
For at least as long as there is no direct OCR support in PyMuPDF - if not even beyond that - I would like your solution being included in the recipes of the documentation or in the PyMuPDF-Utilities repo.
Because of the heavy package calibers underneath easyocr (pytorch and friends), there is no way to make it an integral part of PyMuPDF.

3 replies

victor-ab Mar 10, 2021
Author

Go on and add that!

I agree with you. Most of the good OCR tools are going to be very heavy. easyocr performs better than tesseract, but it has it cost

JorjMcKie Mar 10, 2021
Maintainer

Ok, thanks, will do.

JorjMcKie Mar 20, 2021
Maintainer

@victor-ab - I just published a derivative of your script examples here.
Together with a Tesseract-based version having the same logic.

Map every font character #940

Uh oh!

victor-ab Mar 10, 2021

Replies: 3 comments · 6 replies

Uh oh!

Uh oh!

JorjMcKie Mar 10, 2021 Maintainer

Uh oh!

Uh oh!

victor-ab Mar 10, 2021 Author

Uh oh!

JorjMcKie Mar 10, 2021 Maintainer

Uh oh!

victor-ab Mar 10, 2021 Author

Uh oh!

Uh oh!

shiu886 Nov 15, 2022

Uh oh!

JorjMcKie Mar 10, 2021 Maintainer

Uh oh!

victor-ab Mar 10, 2021 Author

Uh oh!

JorjMcKie Mar 10, 2021 Maintainer

Uh oh!

JorjMcKie Mar 20, 2021 Maintainer

victor-ab
Mar 10, 2021

Replies: 3 comments 6 replies

JorjMcKie
Mar 10, 2021
Maintainer

victor-ab
Mar 10, 2021
Author

JorjMcKie Mar 10, 2021
Maintainer

victor-ab Mar 10, 2021
Author

JorjMcKie
Mar 10, 2021
Maintainer

victor-ab Mar 10, 2021
Author

JorjMcKie Mar 10, 2021
Maintainer

JorjMcKie Mar 20, 2021
Maintainer