Map every font character #940
-
Hi! I want to solve the bad char mapping using some kind of OCR tool, like mentioned in issue #365 Is there a more efficient way to get a set of all the chars glyphs ? chars = []
bboxes = []
blocks = page.getText("rawdict")["blocks"]
text_blocks = [i for i in blocks if i['type']==0]
for b in text_blocks:
for l in b["lines"]:
for s in l["spans"]:
for char in s["chars"]:
if char["c"] not in chars:
chars.append(char["c"])
bboxes.append(char['bbox'])
for bbox in bboxes:
pixmap = page.get_pixmap(clip=fitz.Rect(bbox),matrix = fitz.Matrix(3,3))
display(Image.open(BytesIO(pixmap.getImageData()))) |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
More efficient would be using |
Beta Was this translation helpful? Give feedback.
-
It worked: from PIL import Image
from io import BytesIO, StringIO
import easyocr
def check_bad_char(char):
category = unicodedata.category(char)
is_bad_char = ((category =='Co') or (ord(char)==65533))
return is_bad_char
text_blocks = page.getText("rawdict", flags=0)["blocks"]
chars = []
bboxes = []
for b in text_blocks:
for l in b["lines"]:
for s in l["spans"]:
for char in s["chars"]:
if char["c"] not in chars and check_bad_char(char["c"]):
chars.append(char["c"])
bboxes.append(char['bbox'])
easyocr_reader = easyocr.Reader(['en'])
map_char = {}
for char, bbox in zip(chars, bboxes):
pixmap = page.get_pixmap(clip=fitz.Rect(bbox),matrix = fitz.Matrix(4,4))
image_data = pixmap.getImageData()
char_ocr = easyocr_reader.recognize(image_data,detail=0)[0]
map_char[char] = char_ocr But I think it will be better to do the ocr by word to avoid replacing simple mistakes like "1" and "I", "O and "0". Check this output: {k: v for k, v in sorted(map_char.items(), key=lambda item: item[0])}
|
Beta Was this translation helpful? Give feedback.
-
This is really cool stuff! |
Beta Was this translation helpful? Give feedback.
This is really cool stuff!
For at least as long as there is no direct OCR support in PyMuPDF - if not even beyond that - I would like your solution being included in the recipes of the documentation or in the PyMuPDF-Utilities repo.
Because of the heavy package calibers underneath
easyocr
(pytorch and friends), there is no way to make it an integral part of PyMuPDF.