Skip to content
Discussion options

You must be logged in to vote

You should certainly be able to train an NER model on your data. I guess I'd need to see some example invoices to say if spaCy would have trouble with them or not, but it should be worth a shot at least.

A lot of empty lines is not important. You can just remove them.

One issue would be whether tesseract had done a good job. If your text is mangled spaCy may not be able to recover anything. This would depend on how good the tesseract model is, how clean your images are, and other details.

Another issue is whether layout or word sequences are important. For example, if your invoice number is always in the top right of the page, spaCy has no way of knowing that. But if your document looks l…

Replies: 1 comment 18 replies

Comment options

You must be logged in to vote
18 replies
@ssherlins
Comment options

@polm
Comment options

@polm
Comment options

@ssherlins
Comment options

@polm
Comment options

Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training Training and updating models
3 participants