Newb Question: spaCy vs AutoML #8173

jeff-plummer-radix · 2021-05-21T18:45:35Z

jeff-plummer-radix
May 21, 2021

Newbie Question: I'm trying to decide between Google NLP AutoML or spaCy for my problem. I like the idea of spaCy because it will force me to learn more, but will need to initially invest more with the paid for data annotation tool (google's annotation for automl is built in).

Can spaCy solve the following problem? I have documents that look like the items below.

The Trustor Bob Johnson who lives at 123 W. Elm Street, phoenix, AZ agrees to pay $500 to the Trustee Larry Byrd.
MyCorp at 123 W 1st Street, agrees to lend $500 to the trustor. Freddy Mercury at 125 E 7th street, NY NY agrees to pay at 5% interest over the next 30 years.
Fred Gates and Melinda Gates, a married couple who reside at 156 w mystreet, gilbert AZ, are borrowing the principal amount of $500. The trustee Harry Styles will retain deed until all payments are complete.

Basically every document will have 2 names. The name of the borrower and the name of the person/company being borrowed from. Each name MAY have an address, the borrower definitely should have an address. And there will be an amount. Using simple REGEX I can get about 70% accuracy at finding those key pieces of data.

According to the spaCy training video best practices, I got the impression that spaCy may not be good at distinguishing between the different TYPES of names, and TYPES of addresses. That it would be good at saying 123 E elm street is an address. But may not work for "borrrower address" vs "borrowee address".

Thanks in advance, any help would be greatly appreciated.

Answered by polm

May 22, 2021

AutoML and spaCy are apples and oranges and I'm not sure how much sense a comparison makes, but I'll try to answer your question as best I can.

Also, as a note for anyone else reading this, while it's a great tool and recommended, you really don't have to use Prodigy, and there are free alternatives. They tend to be a little more limited or DIY but if budget is a concern it's an option.

According to the spaCy training video best practices, I got the impression that spaCy may not be good at distinguishing between the different TYPES of names, and TYPES of addresses.

This is not really a limitation of spaCy so much as a limitation of NER technology in general. It's much easier to say "thi…

View full answer

bowwowden · 2021-05-21T21:47:09Z

bowwowden
May 21, 2021

I haven't worked with AutoML but I'd say using spaCy isn't exactly beginner friendly. Bear in mind I'm not exactly an expert at it, but I've been working on a project using it for Named Entity Recognition to identify text from legal documents in AZ. There's not enough information or tutorials on building with it. It is also a negative that you would have to buy Prodigy to do fast work with it.

By any chance, are you scraping the Maricopa County Recorder's Office? If you can get 90 % accuracy without using machine learning, it may not be worth your time. The downside of working with spaCy, probably ML in general, is that by the time you've annotated 2000 examples you may have a deep enough understanding of the documents you're scraping that you could have spent that time writing rules.

It looks like in your use case, you are parsing a TRUSTOR and TRUSTEE every time you run the parser, as well as a principal AMOUNT, and maybe some addresses. I think this is possible with spaCy as long as the documents have some regularity. I know it has some features for finding patterns in grammar, but it'd be better to preprocess the text to be more accurate before spaCy's ML model extracts info from it.

So to your last impression, "According to the spaCy training video best practices, I got the impression that spaCy may not be good at distinguishing between the different TYPES of names, and TYPES of addresses. That it would be good at saying 123 E elm street is an address. But may not work for "borrrower address" vs "borrowee address""

I think spaCy is actually surprisingly good at this if there's enough context. It could identify different types of addresses, but it boils down to which approach you're using. I'm not sure about the perfect approach to it, but I've done it the generic way with Named Entity Recognition and it's accurate as long as the word order is important.

2 replies

jeff-plummer-radix Jul 7, 2021
Author

That is crazy, yes I am getting data from the maricopa county recorders office.

bowwowden Jul 11, 2021

Jeff, my name's Will nice to meet you. I'm no data scientist but I think SpaCy is the best tool for this. I agree with the other guy that NER isn't perfect for parsing two types of addresses, but it's the only way I know. I'm only about 80% accurate with it. I do want to try building a text classifier or dependency parser for higher accuracy because I think 95% should be possible. I think, like he suggests, semantic role labeling is difficult but since these are very regular legal documents it's possible. I have a mobile app that parses trustee data from the recorders website and I built a selenium scraper that takes a long time but can grab over 700 documents from the recorder's office. I think I found you on Linkedin, please add me back if I found the right guy.

polm · 2021-05-22T06:01:25Z

polm
May 22, 2021

AutoML and spaCy are apples and oranges and I'm not sure how much sense a comparison makes, but I'll try to answer your question as best I can.

Also, as a note for anyone else reading this, while it's a great tool and recommended, you really don't have to use Prodigy, and there are free alternatives. They tend to be a little more limited or DIY but if budget is a concern it's an option.

According to the spaCy training video best practices, I got the impression that spaCy may not be good at distinguishing between the different TYPES of names, and TYPES of addresses.

This is not really a limitation of spaCy so much as a limitation of NER technology in general. It's much easier to say "this is a person" than "this is the person who is doing this thing in this circumstance". In fact the second problem is generally called Semantic Role Labelling and is it's own problem in research literature. I would be surprised if Google AutoML would solve that for you out of the box.

Based on your example sentences though, I think you can go a long way with spaCy's default models. If you saw the video that talked about using generic models, you'll remember they mentioned using the dependency parse. You can do that with your example sentences to get useful results in a straightforward way with zero training data.

The Trustor Bob Johnson who lives at 123 W. Elm Street, phoenix, AZ agrees to pay $500 to the Trustee Larry Byrd.

Some patterns you can use here:

Trustor [PERSON] / Trustee [PERSON] → merge entities (so any entity is a single token) and use the Matcher
"Bob Johnson" ... agrees to pay ... Larry Byrd → Use the Dependency Matcher to match "PERSON" "agrees to pay" "PERSON"

You can slowly build on these patterns to improve your coverage. One nice thing about spaCy is that it'll be much better than regex at picking out names for you, and it will also spot monetary amounts as entities you can use.

One thing it doesn't do out of the box is label street addresses, but I think there's some tools in Universe for that; even if not it's not that complicated to wire up any existing address detector as a custom component if you need to.

1 reply

jeff-plummer-radix Jul 7, 2021
Author

I think I'm making progress. I've annotated 50 examples for learning, and SPACY is able to find the data in several test documents. So anecdotally it is looking like spacy will do what I need. I will annotate 150 documents, and do some massive test sets to see how well really works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Newb Question: spaCy vs AutoML #8173

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Newb Question: spaCy vs AutoML #8173

Uh oh!

jeff-plummer-radix May 21, 2021

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

bowwowden May 21, 2021

Uh oh!

jeff-plummer-radix Jul 7, 2021 Author

Uh oh!

bowwowden Jul 11, 2021

Uh oh!

polm May 22, 2021

Uh oh!

jeff-plummer-radix Jul 7, 2021 Author

jeff-plummer-radix
May 21, 2021

Replies: 2 comments 3 replies

bowwowden
May 21, 2021

jeff-plummer-radix Jul 7, 2021
Author

polm
May 22, 2021

jeff-plummer-radix Jul 7, 2021
Author