Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers

I've used the Hindi dataset.

It works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers.


> English words with Hindi text

Example 1:
`आवेदन के नाम लेने से पहले (Registration process के पहले) समझने की बातें`
==> Parsed from the PDF using the below code snippet as:
`आवेदन के नाम लेने से पहले (२८्टा57807) 0700€55 के पहले) समझने की बातें`

Example 2:
`तेजस्विता जब किसी मालिकाना वस्तु पर (Possession) अथवा पद पर (Post/Position) निर्भर होते है`
===> Parsed from the PDF using the below code snippet as:
`तेजस्विता जब किसी मालिकाना वस्तु पर (?०८५७५५०) अथवा पद पर (?०5६/?०5ाधं0ा) निर्भर होते है`

Example 3:
`वस्तुनिष्ठ आनंद (objective happiness) यह हमेशा अपूर्ण होता है`
===> Parsed from the PDF using the below code snippet as:
`वस्तुनिष्ठ आनंद (०णुं०ता५ह 09000655) यह हमेशा अपूर्ण होता है`


> English words & Numbers with Hindi text

Example 1:
`आवेदन लेने की प्रक्रिया (Registration process)) हमें 01/06/2024 से शुरू करनी है।`
===> Parsed from the PDF using the below code snippet as:
`आवेदन लेने की प्रक्रिया (९८8्डा[507820770655) हमें 0/06/2024 से शुरू करनी है।` ====> also missed out 1 in 01


How to reproduce:
```
from pdf2image import convert_from_path
import pytesseract

# Specify Tesseract executable location
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'

# Load and convert PDF to images
documents = convert_from_path("path_to_pdf.pdf")    # Try PDF that has Hindi text mixed with some English words/phrases and/or Numbers

# Extract text from each image in Hindi
page_content = ""
for doc in documents:
    try:
        page_content += pytesseract.image_to_string(doc, lang='hin')
        page_content += "\n"
    except Exception as e:
        print(f"Error in extracting page content for: {doc}")
        pass

print(page_content[0:5])
```


Any idea how I can also parse the Hindi text mixed with some English words/phrases and/or Numbers?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions