Skip to content

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

@ChintanDonda

Description

@ChintanDonda

I've used the Hindi dataset.

It works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers.

English words with Hindi text

Example 1:
आवेदन के नाम लेने से पहले (Registration process के पहले) समझने की बातें
==> Parsed from the PDF using the below code snippet as:
आवेदन के नाम लेने से पहले (२८्टा57807) 0700€55 के पहले) समझने की बातें

Example 2:
तेजस्विता जब किसी मालिकाना वस्तु पर (Possession) अथवा पद पर (Post/Position) निर्भर होते है
===> Parsed from the PDF using the below code snippet as:
तेजस्विता जब किसी मालिकाना वस्तु पर (?०८५७५५०) अथवा पद पर (?०5६/?०5ाधं0ा) निर्भर होते है

Example 3:
वस्तुनिष्ठ आनंद (objective happiness) यह हमेशा अपूर्ण होता है
===> Parsed from the PDF using the below code snippet as:
वस्तुनिष्ठ आनंद (०णुं०ता५ह 09000655) यह हमेशा अपूर्ण होता है

English words & Numbers with Hindi text

Example 1:
आवेदन लेने की प्रक्रिया (Registration process)) हमें 01/06/2024 से शुरू करनी है।
===> Parsed from the PDF using the below code snippet as:
आवेदन लेने की प्रक्रिया (९८8्डा[507820770655) हमें 0/06/2024 से शुरू करनी है। ====> also missed out 1 in 01

How to reproduce:

from pdf2image import convert_from_path
import pytesseract

# Specify Tesseract executable location
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'

# Load and convert PDF to images
documents = convert_from_path("path_to_pdf.pdf")    # Try PDF that has Hindi text mixed with some English words/phrases and/or Numbers

# Extract text from each image in Hindi
page_content = ""
for doc in documents:
    try:
        page_content += pytesseract.image_to_string(doc, lang='hin')
        page_content += "\n"
    except Exception as e:
        print(f"Error in extracting page content for: {doc}")
        pass

print(page_content[0:5])

Any idea how I can also parse the Hindi text mixed with some English words/phrases and/or Numbers?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions