-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hello! Following up on my post on reddit for my own OCR work. I was mostly working with documents from the Ivan Allen Jr. Mayoral Records - https://ivanallen.iac.gatech.edu/mayoral-records/traditional/ - as a research project in grad school. They're all 1960s era documents, all in English, about 80% typed, 20% handwritten, so my results obviously won't generalize to everything.
Here's a summary (from about 2022/4)
- Google Vision/Textract are the best, particularly if you need handwriting, but you're paying/uploading to Google/Amazon
- surya is the new OCR that's surprisingly good, had not heard of it before this year. Works great, just depends on whether the commercial license works for you. On my 3080, took 188 seconds to run 88 pages
- Tesseract is still really good for what it is, the real advantage of Tesseract is it's speed. It's 100% CPU bound and for processing 88 pages, it took:
- tessdata-fast: 121 seconds
- tessdata-best: 201 seconds
- tessdata-default: 150 seconds
- (note in my results, fast was sometimes better than best
- And that was just leaving it single threaded, I could easily parallelize that, from memory it took roughly a 20th of the time.
However the other issue with tesseract or some of the others is if you need to keep things like tables, headers, etc... for formatting. Or if you need non-latin characters, I expect the results could be quite different (like paddleocr might be much better, I think it's more optimized for Chinese)
I've pushed all my code to GitHub, forgive the very dirty nature of it, I had long meant to make it cleaner but never got around to it. Still I hope it should give you and idea of what I've been up to:
https://github.com/driscoll42/VIP_OCR
Everything probably won't work perfectly out of the box, but should be straightforward fixes I could work on if you wanted the help here.
Happy to try to answer any more questions you have or help in ways you'd find useful. It would be really useful to have an OCR benchmark database with sample code and a test set of files for everyone to compare.