Packaging EntityRuler and adding it to another pipeline #9776
-
I suppose this is largely because I'm fairly new to spaCy and NLP in general. But the documentation regarding "Rule-based Matching" is incredibly confusing to me. For example; I'm reading the documentation on how to use the entity ruler, and the first example shows how to add different patterns, by using It's not clear as to why first it's using A second thing I've found confusing, is that there's really no information on how to actually build and package your own models. When searching around, all the examples I've found are fairly similar, but none actually show how to build a real project. For example; I want to build a custom entity ruler, based on a dictionary of words I have. How should I go about that? Do I create a blank spaCy object, and add "entity_ruler" as a pipe? How do I then add this model to another pipeline that loads "en_core_web_lg". Again, this is all probably due to my lack of understanding of NLP. But I've been looking at this for a week or two now, and I just can't get past the basic examples and get a sense of how I actually go about building something with it. Which page or section is this issue related to? |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 13 replies
-
Sorry you're having trouble with this. I'll start with the question from the title here and come back to your other questions.
The entity ruler works using patterns, so you should loop over your dictionary to create patterns. Exactly what patterns depends on the kind of match you want - maybe you want to match all of those if they match regardless of case, in which case you could use them as Phrase Matches and make the entity ruler match on the LOWER attribute. That would look a bit like this.
You don't have to use a blank pipeline here, but it will be faster, see here for notes. On the other hand, if you want to match on values set by other components, such as part of speech tags, you would need to use components that set those values. If you're just matching literal strings using a blank pipeline should be best.
You source the component. Suppose you create your component and save it to disk like this:
You could then add it to another pipeline like this:
The docs on sourcing pipelines also cover how to do it in a config.
There is no easy way to do this. While using named entities as features for document classification is done sometimes, it's not very common. In particular, if you're just matching literal strings it probably doesn't provide much over what the text classifier would learn itself, since it will already learn values for all the words it sees. Some of the ways you could use the entities would be:
Sorry that was confusing. The difference between these two ways of creating the pipeline isn't really important here -
Have you seen the example projects? Not sure if it's what you had in mind, but they cover data preprocessing, model training, and packaging the final model. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I will try to create a new discussion. I'm not sure how to digest that second to last sentence though, about if the ruler is not created that way there is no way for it to be connected to the nlp object. Is that nlp object the one that ends up being loading after building and installing the package then loading with spacy.load()? I also dont actually want the EntityRuler in the pipeline the goal was for the pii component to be added to the pipeline. |
Beta Was this translation helpful? Give feedback.
-
Hi @darrkj, it definitely takes some time getting used to the v3 config and projects systems. Have you had a chance to look into the video's we distributed to explain these concepts in more detail? cf https://www.youtube.com/watch?v=9k_EfV7Cns0 & https://www.youtube.com/watch?v=BWhh3r6W-qE&t=2s It can really help to understand the design in general when implementing a specific pipeline for a specific use-case.
You should definitely be able to use the I have to admit to you that I find it very hard to follow all your other questions and comments. An example code snippet or config file typically paints a thousand words. If you have further questions - please create a new discussion thread and share actual code (not screenshots) to help us better understand the specific issues you run into. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Could someone please provide a short example of how this EntityRuler init method can be used (from spaCy docs):
Specifically, how can an entityruler entity created in this way be added to a pipeline, or is it to be used in an entirely different way, or for a different purpose? |
Beta Was this translation helpful? Give feedback.
Sorry you're having trouble with this. I'll start with the question from the title here and come back to your other questions.
The entity ruler works using patterns, so you should loop over your dictionary to create patterns. Exactly what patterns depends on the kind of match you want - maybe you want to match all of those if they match regardless of case, in which case you could use them as Phrase Matches and make the entity ruler match on the LOWER attribute. That would look a bit like this.