Entity Recognition performs very bad on invoice data #10767
Replies: 1 comment
-
Hi @rsoika ,
Can you link the previous conversation where we said this? So that other users can refer to it as well.
The thing about Entity Recognition is that it performs better when there is "context" in the text. If you're showing it individual bits of information, then it might be better to include handcrafted business rules and the Matcher to extract them. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
we are trying to provide a solution to detect invoice data from scanned (OCR) invoices using spaCy entity recognition. The entities we are interested in are
The spaCy team warned my some time ago that this is not the typical data for NLR but we tried it anyway. We have about 100.000 invoices from a small amount of companies (e.g. less than 150). We expected the detection rate to increase over time. We train the data with each new invoice. This gives us a permanent training situation. The software is now running for more than a year. We are using spacy>=3.1.1,<3.2.0
But the model did not perform very good and we see that in many cases the performance decreases over the time.
Some of the invoice entities perform better - e.g.
but other parts are very bad detected
Could a reason for the poor performance be, that the entities are often groups of characters that Spacy recognizes as individual words and not as a coherent term (e.g. the IBAN or the date '09 May 2022' as in the example above)?
I wonder if there is some way to give spaCy a kind of hint abut the fact that the text is not a 'natural' literary text?
We also do remove all NewLines, Tabs and Space-Sequences from the origin text. Event if this for a human the natural information finding things. E.g a Date or the Amount is often at the end of a new line. A company name is often on top and at the beginning of a line.
Does anybody have an idea how we can improve the results using spaCy? Or is NLR the total wrong approach to solve such kind of problem?
Thanks for any help
====
Ralph
Beta Was this translation helpful? Give feedback.
All reactions