Spacy for invoice data extraction #8187
-
Hi, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 18 replies
-
You should certainly be able to train an NER model on your data. I guess I'd need to see some example invoices to say if spaCy would have trouble with them or not, but it should be worth a shot at least. A lot of empty lines is not important. You can just remove them. One issue would be whether tesseract had done a good job. If your text is mangled spaCy may not be able to recover anything. This would depend on how good the tesseract model is, how clean your images are, and other details. Another issue is whether layout or word sequences are important. For example, if your invoice number is always in the top right of the page, spaCy has no way of knowing that. But if your document looks like this:
Then spaCy can probably learn that, or you can use the rule based matchers to find it. spaCy is good at learning from sequences of words, so if you just sort of have a random list of numbers and a few words it may be hard to learn. If you can provide some examples we can probably give more tailored advice. |
Beta Was this translation helpful? Give feedback.
You should certainly be able to train an NER model on your data. I guess I'd need to see some example invoices to say if spaCy would have trouble with them or not, but it should be worth a shot at least.
A lot of empty lines is not important. You can just remove them.
One issue would be whether tesseract had done a good job. If your text is mangled spaCy may not be able to recover anything. This would depend on how good the tesseract model is, how clean your images are, and other details.
Another issue is whether layout or word sequences are important. For example, if your invoice number is always in the top right of the page, spaCy has no way of knowing that. But if your document looks l…