Identification and Access of individual patterns? #8723
Replies: 2 comments 4 replies
-
Using the EntityRuler you can add distinct ids to match patterns with the same label. It's not automatic, but you can just give ids to your patterns and use those, for example. Does that solve your use case? |
Beta Was this translation helpful? Give feedback.
-
@polm
That is to append the matched pattern index to the end of the tuple. This would cause almost zero impact on current users. I have a 3-million entry legacy rule base in a plain text file, where each line can be seen as one rule, and I want to convert them into spacy's rules. Each rule basically checks the existence of certain words with a linear order, and each rule has a 'trigger' column. An incoming text won't be matched against any rules unless it contains at least one trigger word. For optimization, I want to create a dict mapping trigger words to a subset of rules (spacy matchers). This task is similar to entity recognition, but not. It is to assign keywords to a doc based on rule matching. To achieve this, I define the class as follows:
Here instead of typically defining one matcher in a component, I have a dict of matchers, and the matchers will be created in the following:
The 'new_id' creation is the way I create a correspondence between the pattern index and its match results. In particular, There is two other reasons I didn't use EntityRulers:
Is that straightforward to make this extension:
Current users will only access tuple[0] and tuple[1] as used to be, and they don't need to care about tuple[3] if they don't want to use it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The match_id 'HelloWorld' correspond to the two patterns of HelloWorld. For the current 'matches' object:
[(11514107010766861231, 2, 43)]
How hard to add one more element to each match object, that is, the index of the matched pattern? For these two example inputs:
The match outputs would be:
[(11514107010766861231, 0, 2, 43)]
[(11514107010766861231, 1, 2, 43)]
The added '0' and '1' indicates which specific pattern captured the incoming string, and therefore, the 'match_id' may need to be different. This could be useful for two reasons:
These two would require access to the original specific pattern for each match object. For example, my gramma may look like the following:
To achieve this, what lacks is the correspondence between a matched object and its pattern defined. Everything else is already there.
How hard to add this correspondence in the current system? Any suggestions for workaround?
Beta Was this translation helpful? Give feedback.
All reactions