Best way to gather relationships between entities (NER and REL models) #10808

fernandonjardim · 2022-05-16T15:02:24Z

fernandonjardim
May 16, 2022

I am trying to create a model using SPACY to output this JSON, given this text.

To approach this problem I've tried to combine a NER + REL model, but this is not quite good and takes a LOT of time labelling the entities.

Would you suggest a best approach on how could I tackle this challenge using the least labelling possible?

ljvmiranda921 · 2022-05-17T05:21:43Z

ljvmiranda921
May 17, 2022

Hi @fernandonjardim ,

For posterity, I advise you to copy the sample JSON in your post. Be sure to enclose it with triple-backticks so that it's more readable. The GDocs file may get deleted, or its access may change, and future users may not be able to see what the post is about.

For the problem itself, you'd definitely want to label at least a hundred or so samples to get a decent model. However, it seems that some of the entities can be obtained by simple business rules. A combination of that + a model might actually help. You can use spaCy's Matcher for creating some rule sets. You might also want to explore techniques like weak supervision to reduce your labelling costs.

0 replies

fernandonjardim · 2022-05-17T13:49:10Z

fernandonjardim
May 17, 2022
Author

Hi @ljvmiranda921 ![

Thanks very much for your fast answer and tips!

So instead of having two models, one for NER and another one for Relationship between NERs (as I am currently doing it, following this article), you would recommend to go with only NERs, doing some auto labelling and use SPACY's Matcher, is it right?

In most of my corpus, the entities /tokens can be really spread out, as follows:

image_2022-05-17_143613693

On the example above the label "included" (third in blue), got several children ("optionals", in black), and some are really far away, in some cases in other paragraphs. Would the matcher still work in this case?

What I guess I am missing, is how to use the matcher on this long distance cases (picture), with the "Hello, World" example, it seems easier haha

Also some spans can get more than a label. For instance, when I am describing the itinerary on "Day 2", the span "schooner tour" belong to the hash "day_2" but also to the "not_included" one. Can the matcher get those nuances too?

Super thanks for your availability!

1 reply

ljvmiranda921 May 19, 2022

So instead of having two models, one for NER and another one for Relationship between NERs (as I am currently doing it, following this article), you would recommend to go with only NERs, doing some auto labelling and use SPACY's Matcher, is it right?

I'm thinking about using a combination of business rules and the NER for extracting the entities. Then once you have those entities, then you can apply whatever downstream tasks you need (getting relationships, etc.).

Also some spans can get more than a label. For instance, when I am describing the itinerary on "Day 2", the span "schooner tour" belong to the hash "day_2" but also to the "not_included" one. Can the matcher get those nuances too?

It might be possible to configure a very custom Matcher for that (i.e., create multiple rules that obtains the same thing). If you're performing entity extraction on overlapping entities, perhaps you can check out the spancat component.

fernandonjardim · 2022-05-19T10:40:44Z

fernandonjardim
May 19, 2022
Author

Hi @ljvmiranda921 !

Thanks very much for your answer!

I still a bit blocked on getting the relations, and when is better to use business rules and when is better to use NER.

I've been through all the spacy docs, and all of them use short sentences as an example. When you get a sentence such as the following, it is easy to create business rules using a matcher or do dependencies analysis:

"Apple is opening its first big office in San Francisco"

What I am not getting it is how can I use these tools in long sentences, in which the children / dependencies are spread around many sentences as the following example, copying the image as a text, this is a bit of our corpora:

REVEILLON 2022 ARRAIAL DO CABO DATE 27/12/2021 to 02/01/2022. PACKAGES INCLUDE Transport by executive bus All our partners have the certifications required by ARTESP and ANTT as well as day maintenance and passenger insurance. 6 nights accommodation @ a Hostel, during the whole period of the trip Accommodation at a Hostel with excellent location 3 minutes from Praia dos Anjos 5 minutes from Praia Grande 5 minutes from the trail to Praia do Forno. Near by restaurants bars markets pharmacies and other establishments. Our check-in will be immediate when we arrive at the place on the morning of the 28th. We do not need to follow the hotel system where the check-in is from 14.00. - breakfast included on 28 29 30 31 01 and 02. Welcoming to Arraial do Cabo Barbecue  On the first night in Arraial do Cabo we will have the welcome barbecue with a variety of salted meats and garnishes. NYE Party with Open bar Vodka gin catuaba beer energy drink refrigerant and water. Guided walks to Praia do Forno.

to extract this info:

{"included": ["Transport by executive bus", "passenger insurance", "6 nights accommodation @ a Hostel", "breakfast included on 28 29 30 31 01 and 02", "Welcoming to Arraial do Cabo Barbecue", "On the first night in Arraial do Cabo we will have the welcome barbecue with a variety of salted meats and garnishes.", "NYE Party with Open bar Vodka gin catuaba beer energy drink refrigerant and water", "Guided walks to Praia do Forno"]}

For this case, would you suggest NER labelling? Or some how still using business rules? If business rules, would you mind in providing just a very simple example, so I can replicate to all the other entities / relations I need to extract?

Super thanks.

1 reply

ljvmiranda921 May 25, 2022

Hi @fernandonjardim ,

My advice would be to try both: (1) see if the business rules get atleast 70-80% of your use-cases then see if (2) NER can do the same. By looking at your text, it's still a bit unclear (or there's no consistent definition) what is considered as an entity, so you might get more mileage just trying out both and see what sticks.

For example, the "included" tag doesn't seem to have a consistent pattern to train a model upon. As a naive example of a pure business rule approach, what might work is to look for texts after a specific keyword (or between two keywords) and then prune the output to get the thing that you want.

fernandonjardim · 2022-06-03T22:46:47Z

fernandonjardim
Jun 3, 2022
Author

From a scale of 0 o 10, could you please roughly tell me how hard is this challenge?

I've been months on it and I just cannot ship it out in a simple way which is not NER + Relationship.

Having things spread out the text is what is really making this hard. For instance sometimes I have things on the same chunk: "Boarding @ 11 @ Metro Vergueiro" and sometimes I have things on the following "bla bla bla bla bla bla boarding: we will meet in front of the shopping and we will meet @ 11 @ metro Vergueiro" and this makes hard to come up with a scheme for it

27 replies

fernandonjardim Sep 1, 2022
Author

Super thanks @polm

I am having some issues with PhraseMatcher, could you give me a hand?

I am trying to approach the same problem, but using the PhraseMatcher instead of the Entity Ruler.

My goals is to return a doc in which 22h22 is tagged. , but it seems that it has not been added to the pipeline.

Here is my full code


#installs
!python -m spacy download pt

#imports
import spacy 
from spacy import displacy
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher

#creating doc
nlp_test_time = spacy.blank("pt")
doc = nlp_test_time("the time now is 22h22")
doc.ents
# gets () here

#adding PhraseMatcher to pipe
matcher_time = PhraseMatcher(nlp_test_time.vocab, attr="SHAPE")
matcher_time.add("TIME", [nlp_test_time("22h12")])

#applying pipeline to doc
nlp_test_time(doc)
nlp_test_time.ents
# gets () here too, when I thought I was going to get `22h12`

Here I was expecting to get 22h12 as TIME

Do you know what I could be doing it wrongly?

Super thanks in advance (:

polm Sep 2, 2022

#adding PhraseMatcher to pipe
matcher_time = PhraseMatcher(nlp_test_time.vocab, attr="SHAPE")

This doesn't add a PhraseMatcher to your pipeline, it just creates a PhraseMatcher. To add to your pipeline you need to call nlp.add_pipe, but the PhraseMatcher isn't a component. You can use the EntityRuler, which wraps the PhraseMatcher.

fernandonjardim Sep 2, 2022
Author

Hi @polm

Thanks very much, couldn' t get then how do I add the PhraseMatcher into the ruler still.. What I am trying to accomplish is to have a doc with the entities, not just the entities....

ruler = nlp_test_time.add_pipe("entity_ruler")
matcher_time = PhraseMatcher(nlp_test_time.vocab, attr="SHAPE")
matcher_time.add("TIME", [nlp_test_time("22h12")])
ruler.add_patterns("TIME")

Do I need a decorator or something like that ?

polm Sep 2, 2022

You don't create the PhraseMatcher manually - the EntityRuler has a PhraseMatcher internally. Please read the EntityRuler documentation - you can set a phrase_matcher_attr, and then any strings you pass will match using the internal EntityRuler.

In this case, since you're only matching one token, you can just use a normal Matcher rule in the EntityRuler though.

import spacy

nlp = spacy.blank("en")

ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "TIME", "pattern": [{"SHAPE": "ddxdd"}]}])

doc = nlp("The time is 12h08 23")
for ent in nlp("The time is 12h08").ents:
    print(ent.label_, ent, sep="\t")

fernandonjardim Sep 2, 2022
Author

Super thanks for all your support @polm you rock!
if you ever in lisbon let me know, need to pay you some beers haha

fernandonjardim · 2022-10-11T07:17:54Z

fernandonjardim
Oct 11, 2022
Author

Super thanks for the feedback polm ((:

…

On Fri, Aug 26, 2022, 6:53 AM polm ***@***.***> wrote: Note: I figured out what was up with the web demo - it was highlight all entities, not just entities created by matching the rules. So the example you had was being matched as a QUANTITY or something. This issue with the web demo has been resolved. — Reply to this email directly, view it on GitHub <#10808 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHRL2OMQ3COHYFTPHJ3D4LDV3BLWJANCNFSM5WBZNOMQ> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Uh oh!

Best way to gather relationships between entities (NER and REL models) #10808

Uh oh!

fernandonjardim May 16, 2022

Replies: 5 comments · 29 replies

Uh oh!

ljvmiranda921 May 17, 2022

Uh oh!

Uh oh!

fernandonjardim May 17, 2022 Author

Uh oh!

ljvmiranda921 May 19, 2022

Uh oh!

fernandonjardim May 19, 2022 Author

Uh oh!

ljvmiranda921 May 25, 2022

Uh oh!

fernandonjardim Jun 3, 2022 Author

Uh oh!

Uh oh!

fernandonjardim Sep 1, 2022 Author

Uh oh!

polm Sep 2, 2022

Uh oh!

Uh oh!

fernandonjardim Sep 2, 2022 Author

Uh oh!

polm Sep 2, 2022

Uh oh!

fernandonjardim Sep 2, 2022 Author

Uh oh!

fernandonjardim Oct 11, 2022 Author

fernandonjardim
May 16, 2022

Replies: 5 comments 29 replies

ljvmiranda921
May 17, 2022

fernandonjardim
May 17, 2022
Author

fernandonjardim
May 19, 2022
Author

fernandonjardim
Jun 3, 2022
Author

fernandonjardim Sep 1, 2022
Author

fernandonjardim Sep 2, 2022
Author

fernandonjardim Sep 2, 2022
Author

fernandonjardim
Oct 11, 2022
Author