Skip to content

opsabarsec/Receipts-OCR-on-colabs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Receipts optical character reading: Pytesseract vs. EasyOCR.

Details available as blog post.

process

1. Background

Receipts carry the information needed for trade to occur between companies and much of it is on paper or in semi-structured formats. Traditionally this has been achieved by manually extracting the relevant information and inputting it into a database which is a labor-intensive and expensive process.Extracting key information from receipts and converting them to structured documents can serve many applications and services, such as efficient archiving, fast indexing and document analytics. Alas, we can take a picture of the receipt and the first task to accomplish the information extraction is to cast the characters and digits into a text file. For this task there are open-source packages available. The most common is Tesseract, often used together with the image-processing package OpenCV. These are tested and compared to a newer Python library, EasyOCR.

2. The data

200 images of restaurants/bars receipts were downloaded at the following link

3. OCR libraries test results

Both packages have been tested on a Jupyter notebook running on Google Colabs. Code is available here .

Pytesseract runs well just using CPU but often does not produces accurate outputs due to focus/shadows issues of the original image. I have tried preprocessing using OpenCv but that helped moderately.

Better results were obtained using the EasyOCR library. It requires GPU accelerated environment though and runs smoothly when this is activated in a Colabs notebook.

OCR

About

Pytesseract vs. EasyOCR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors