Variability of words / phrases that are matching with specific label #7570

aleksandar-devedzic · 2021-03-26T00:35:03Z

aleksandar-devedzic
Mar 26, 2021

I want to make a NER model that will recognise street addresses.
That kind of data have a big variability, for example.

[8237 Monroe Drive
Bountiful, UT 84010
16 Dogwood Ave.
Grand Island, NE 68801
634 Poplar Ave.
Elyria, OH 44035
79 Sheffield Dr.
Cranford, NJ 07016
97 Colonial Dr.
Salem, MA 01970
211 Oakland St.
Yakima, WA 98908]
...
+much more names (that are depending on country etc.)
...

And if I want to train my model for street recognition:

training_data = [('send to: Aargauerstrasse 8005', {'entities': [(9, 28, 'ADDRESS')]}), 
                ('send to: Abeggweg 8057', {'entities': [(9, 21, 'ADDRESS')]}), 
                ('send to: Abendweg 8038', {'entities': [(9, 21, 'ADDRESS')]}), 
                ('send to: Ackermannstrasse 8044', {'entities': [(9, 29, 'ADDRESS')]}), 
                ('send to: Aehrenweg 8050', {'entities': [(9, 22, 'ADDRESS')]}), 
                ('send to: Aemmerliweg 8050', {'entities': [(9, 24, 'ADDRESS')]}), 
                ('send to: Albisgütliweg 8045', {'entities': [(9, 26, 'ADDRESS')]}), 
                ('send to: Albisstrasse 8038', {'entities': [(9, 25, 'ADDRESS')]}), 
                ('send to: Albulastrasse 8048', {'entities': [(9, 26, 'ADDRESS')]}), 
                ('send to: Alderstrasse 8008', {'entities': [(9, 25, 'ADDRESS')]})]

Is it possible to get good results?
I mean, street names can have 1-4 words in them, they can have numbers etc.
Is it possible to get good results because Ill always get different street addresses?

polm · 2021-03-26T04:14:25Z

polm
Mar 26, 2021

Whether you can get good results or not depends on what kind of situation you want to recognize street addresses in.

Are you recognizing streets addresses written in isolation? Or do you want to parse a whole address, with street, city, country, and so on, and isolate the street part? In that case the pretrained spaCy models won't help you much because they're trained on text like newspaper articles that uses complete sentences. But you can train a custom model that can learn how commas and words like "City" or "St." are significant.

On the other hand, if you're recognizing addresses in sentences ("Mr. Smith lives at 2 3/8 Strawberry Lane..."), the existing models can be a good starting point.

Either way, your current training data is unhelpful - the point of training data is to show the kind of things you want to find in in a wide variety of situations, so repeating the same template over and over again won't work.

What kind of text do you want to use your address matcher on? Do you want to parse the address into parts or just recognize it?

8 replies

polm Mar 26, 2021

Ideally you should manually create training data, yes. That means going over your documents and marking where the addresses are.

If you have a good tool like Prodigy, doing 1000 examples or so really does not take a long time.

If you do not want to manually label data, you can do what's called weak supervision - you can use rules to create annotations and train a model and improve it gradually over time. For example, with addresses, you could use rules to mark "[anything] St." as an address. This will create false positives, for example you could have "St. John's School" or something, but it's a starting point.

aleksandar-devedzic Mar 26, 2021
Author

Is Prodigy free tool?
If not, is there a trial version?

polm Mar 26, 2021

Prodigy is not free. You can check out the online demo to get an idea how the UI works, or see the ordering FAQ for information about a trial.

aleksandar-devedzic Mar 26, 2021
Author

do you think that I can fix this kind of problem with rule-based-matching?

Please check out the question:
https://stackoverflow.com/questions/66821133/creating-rule-based-matching-with-spacy-and-python-for-detecting-addresses

polm Mar 27, 2021

Yes, I think rule-based matching is a good place to start for your problem. I answered the question on Stack Overflow, but please also read the docs, they explain how to solve problems like this.

Also please do not ping me on StackOverflow as well as here, the extra notifications are not helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Variability of words / phrases that are matching with specific label #7570

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Variability of words / phrases that are matching with specific label #7570

Uh oh!

aleksandar-devedzic Mar 26, 2021

Replies: 1 comment · 8 replies

Uh oh!

polm Mar 26, 2021

Uh oh!

polm Mar 26, 2021

Uh oh!

aleksandar-devedzic Mar 26, 2021 Author

Uh oh!

polm Mar 26, 2021

Uh oh!

aleksandar-devedzic Mar 26, 2021 Author

Uh oh!

polm Mar 27, 2021

aleksandar-devedzic
Mar 26, 2021

Replies: 1 comment 8 replies

polm
Mar 26, 2021

aleksandar-devedzic Mar 26, 2021
Author

aleksandar-devedzic Mar 26, 2021
Author