Newb Question: spaCy vs AutoML #8173
-
Newbie Question: I'm trying to decide between Google NLP AutoML or spaCy for my problem. I like the idea of spaCy because it will force me to learn more, but will need to initially invest more with the paid for data annotation tool (google's annotation for automl is built in). Can spaCy solve the following problem? I have documents that look like the items below.
Basically every document will have 2 names. The name of the borrower and the name of the person/company being borrowed from. Each name MAY have an address, the borrower definitely should have an address. And there will be an amount. Using simple REGEX I can get about 70% accuracy at finding those key pieces of data. According to the spaCy training video best practices, I got the impression that spaCy may not be good at distinguishing between the different TYPES of names, and TYPES of addresses. That it would be good at saying 123 E elm street is an address. But may not work for "borrrower address" vs "borrowee address". Thanks in advance, any help would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
I haven't worked with AutoML but I'd say using spaCy isn't exactly beginner friendly. Bear in mind I'm not exactly an expert at it, but I've been working on a project using it for Named Entity Recognition to identify text from legal documents in AZ. There's not enough information or tutorials on building with it. It is also a negative that you would have to buy Prodigy to do fast work with it. By any chance, are you scraping the Maricopa County Recorder's Office? If you can get 90 % accuracy without using machine learning, it may not be worth your time. The downside of working with spaCy, probably ML in general, is that by the time you've annotated 2000 examples you may have a deep enough understanding of the documents you're scraping that you could have spent that time writing rules. It looks like in your use case, you are parsing a TRUSTOR and TRUSTEE every time you run the parser, as well as a principal AMOUNT, and maybe some addresses. I think this is possible with spaCy as long as the documents have some regularity. I know it has some features for finding patterns in grammar, but it'd be better to preprocess the text to be more accurate before spaCy's ML model extracts info from it. So to your last impression, "According to the spaCy training video best practices, I got the impression that spaCy may not be good at distinguishing between the different TYPES of names, and TYPES of addresses. That it would be good at saying 123 E elm street is an address. But may not work for "borrrower address" vs "borrowee address"" I think spaCy is actually surprisingly good at this if there's enough context. It could identify different types of addresses, but it boils down to which approach you're using. I'm not sure about the perfect approach to it, but I've done it the generic way with Named Entity Recognition and it's accurate as long as the word order is important. |
Beta Was this translation helpful? Give feedback.
-
AutoML and spaCy are apples and oranges and I'm not sure how much sense a comparison makes, but I'll try to answer your question as best I can. Also, as a note for anyone else reading this, while it's a great tool and recommended, you really don't have to use Prodigy, and there are free alternatives. They tend to be a little more limited or DIY but if budget is a concern it's an option.
This is not really a limitation of spaCy so much as a limitation of NER technology in general. It's much easier to say "this is a person" than "this is the person who is doing this thing in this circumstance". In fact the second problem is generally called Semantic Role Labelling and is it's own problem in research literature. I would be surprised if Google AutoML would solve that for you out of the box. Based on your example sentences though, I think you can go a long way with spaCy's default models. If you saw the video that talked about using generic models, you'll remember they mentioned using the dependency parse. You can do that with your example sentences to get useful results in a straightforward way with zero training data.
Some patterns you can use here:
You can slowly build on these patterns to improve your coverage. One nice thing about spaCy is that it'll be much better than regex at picking out names for you, and it will also spot monetary amounts as entities you can use. One thing it doesn't do out of the box is label street addresses, but I think there's some tools in Universe for that; even if not it's not that complicated to wire up any existing address detector as a custom component if you need to. |
Beta Was this translation helpful? Give feedback.
AutoML and spaCy are apples and oranges and I'm not sure how much sense a comparison makes, but I'll try to answer your question as best I can.
Also, as a note for anyone else reading this, while it's a great tool and recommended, you really don't have to use Prodigy, and there are free alternatives. They tend to be a little more limited or DIY but if budget is a concern it's an option.
This is not really a limitation of spaCy so much as a limitation of NER technology in general. It's much easier to say "thi…