Spacy for invoice data extraction #8187

ssherlins · 2021-05-24T05:51:27Z

ssherlins
May 24, 2021

Hi,
I recently came across Spacy when searching for a ML model to extract data from invoices. We have many different formats (300+) of invoices, so I was looking to see if there is a way to train an ML model which can extract details like invoice number, purchase order number, invoice date etc. So that's when I read about NER and Spacy. The invoices are in PDF format, most of which are scanned. So I used Tesseract to extract data from PDF into text files and used that data to annotate and train. The extracted data had a lot of empty lines. Will that affect the training? I used the following annotator: https://github.com/tecoholic/ner-annotator. I also want to know whether it is right to use NER for data extraction from invoices with multiple formats.

Answered by polm

May 24, 2021

You should certainly be able to train an NER model on your data. I guess I'd need to see some example invoices to say if spaCy would have trouble with them or not, but it should be worth a shot at least.

A lot of empty lines is not important. You can just remove them.

One issue would be whether tesseract had done a good job. If your text is mangled spaCy may not be able to recover anything. This would depend on how good the tesseract model is, how clean your images are, and other details.

Another issue is whether layout or word sequences are important. For example, if your invoice number is always in the top right of the page, spaCy has no way of knowing that. But if your document looks l…

View full answer

polm · 2021-05-24T10:59:15Z

polm
May 24, 2021

You should certainly be able to train an NER model on your data. I guess I'd need to see some example invoices to say if spaCy would have trouble with them or not, but it should be worth a shot at least.

A lot of empty lines is not important. You can just remove them.

One issue would be whether tesseract had done a good job. If your text is mangled spaCy may not be able to recover anything. This would depend on how good the tesseract model is, how clean your images are, and other details.

Another issue is whether layout or word sequences are important. For example, if your invoice number is always in the top right of the page, spaCy has no way of knowing that. But if your document looks like this:

Invoice number: 1235456

Then spaCy can probably learn that, or you can use the rule based matchers to find it. spaCy is good at learning from sequences of words, so if you just sort of have a random list of numbers and a few words it may be hard to learn.

If you can provide some examples we can probably give more tailored advice.

18 replies

ssherlins May 27, 2021
Author

Thank you @walidamamou.

polm May 27, 2021

Here's a very simple training data creation script.

from spacy.tokens import Doc, DocBin
import spacy

# For this toy example there's just one doc, but you should have several
# hundred at a minimum.
DATA = [
   {"text": "I like cheese.", "entities": [[7, 13, "FOOD"]]}
   ]

nlp = spacy.blank("en")
# A DocBin is basically a special list that will save docs on disk
doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])
for eg in DATA:
    # for each example, make a Doc that looks like your desired output
    doc = nlp(eg["text"])
    doc.ents = [
        doc.char_span(s[0], s[1], label=s[2])
        for s in eg.get("entities", [])
    ]
    # add the Doc to the DocBin
    doc_bin.add(doc)
# "train.spacy" is the name of a file which will hold your training data
doc_bin.to_disk("train.spacy")
print(f"Processed {len(doc_bin)} documents")

Note that spaCy has no way to make use of image data.

polm May 27, 2021

Oh, one other thing - in your annotations, it looks like you were treating each line in your document as a spaCy Doc, but it would probably make more sense to treat them as one big Doc.

ssherlins May 27, 2021
Author

Yes, that is because after OCR the output that I get has a lot of empty lines in between. I believe appending the lines as a single string should do the job. Also does utf-8 encoding on data affect the training?

polm May 27, 2021

UTF8 is the only supported encoding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spacy for invoice data extraction #8187

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 18 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spacy for invoice data extraction #8187

Uh oh!

ssherlins May 24, 2021

Replies: 1 comment · 18 replies

Uh oh!

polm May 24, 2021

Uh oh!

ssherlins May 27, 2021 Author

Uh oh!

Uh oh!

polm May 27, 2021

Uh oh!

polm May 27, 2021

Uh oh!

ssherlins May 27, 2021 Author

Uh oh!

polm May 27, 2021

ssherlins
May 24, 2021

Replies: 1 comment 18 replies

polm
May 24, 2021

ssherlins May 27, 2021
Author

ssherlins May 27, 2021
Author