Identification and Access of individual patterns? #8723

lingvisa · 2021-07-14T21:01:36Z

lingvisa
Jul 14, 2021

patterns = [
    [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
    [{"LOWER": "hello"}, {"LOWER": "world"}]
]
matcher.add("HelloWorld", patterns)

The match_id 'HelloWorld' correspond to the two patterns of HelloWorld. For the current 'matches' object:

[(11514107010766861231, 2, 43)]

How hard to add one more element to each match object, that is, the index of the matched pattern? For these two example inputs:

'Hello, World'
'Hello World'

The match outputs would be:
[(11514107010766861231, 0, 2, 43)]
[(11514107010766861231, 1, 2, 43)]

The added '0' and '1' indicates which specific pattern captured the incoming string, and therefore, the 'match_id' may need to be different. This could be useful for two reasons:

You may want to define a confidence level for each specific pattern, i.e. 0.9 or 0.5
You may want to define a token distance requirement when using RE operators.

These two would require access to the original specific pattern for each match object. For example, my gramma may look like the following:

[
  {"book_d62b0ac856b7bc6d07326e9412ede0a3":
    [
      [{XXX}],
      [{XXX}}],
      [{XXX}}]
    ],
    "trigger": "XXX",
    "distance": "20,20,6",
    "note": "This is an illustrational rule",
    "test": "unit_test_1; unit_test_2; unit_test_3",
    "confidence": "1.0, 0.3, 0.6"
  }
}

To achieve this, what lacks is the correspondence between a matched object and its pattern defined. Everything else is already there.

How hard to add this correspondence in the current system? Any suggestions for workaround?

polm · 2021-07-15T03:56:51Z

polm
Jul 15, 2021

Using the EntityRuler you can add distinct ids to match patterns with the same label. It's not automatic, but you can just give ids to your patterns and use those, for example.

Does that solve your use case?

3 replies

lingvisa Jul 15, 2021
Author

@polm The EntityRuler with id should work. Since my rules are written already in this Matcher's form:
[{'labelName': []}]
I found another workaround: when parsing the rules, I append the pattern index to the label name:

                  for index, pattern in enumerate(patterns):
                        new_label = label + '_' + str(index)
                        self.matcher.add(new_label, [pattern])

Instead of adding a list of patterns for each iteration to the matcher, I add a single-element list of patterns to the matcher. I hope this won't cause performance penalty when matching compared with EntityRuler, since EntityRule should use Matcher behind the scene.

lingvisa Jul 15, 2021
Author

Since pattern ID is already supported in EntityRuler, can it also be supported in matcher's rule:

[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}, id='x'],
[{"LOWER": "hello"}, {"LOWER": "world"}, id='y']

Or, in the match object, include the pattern index in matched results:

[(0, 11514107010766861231, 2, 43)]
[(1, 11514107010766861231, 2, 43)]

polm Jul 17, 2021

I don't think we're interested in adding that feature to the Matcher - we don't want to change the API if possible, and the feature is already supported, with extras and no particular overhead, in the EntityRuler. If you don't want to add the EntityRuler to your pipeline it's just a component so you can use it as a function on docs if you want.

Your workaround about appending the pattern index to the label name is actually what the EntityRuler is doing internally to track ids, and shouldn't cause performance issues.

lingvisa · 2021-07-19T16:25:20Z

lingvisa
Jul 19, 2021
Author

@polm
If the Matcher's match result is in this format, the API change would be minimal:

[(11514107010766861231, 2, 43, 0)]
[(11514107010766861231, 2, 43, 1)]

That is to append the matched pattern index to the end of the tuple. This would cause almost zero impact on current users.

I have a 3-million entry legacy rule base in a plain text file, where each line can be seen as one rule, and I want to convert them into spacy's rules. Each rule basically checks the existence of certain words with a linear order, and each rule has a 'trigger' column. An incoming text won't be matched against any rules unless it contains at least one trigger word. For optimization, I want to create a dict mapping trigger words to a subset of rules (spacy matchers). This task is similar to entity recognition, but not. It is to assign keywords to a doc based on rule matching. To achieve this, I define the class as follows:

class DocLabeller(object):

    def __init__(self, nlp):
        self.nlp = nlp
        self.matchers = defaultdict(list)

Here instead of typically defining one matcher in a component, I have a dict of matchers, and the matchers will be created in the following:

               ```
               for index, pattern in enumerate(patterns):
                    new_id = id + '_' + str(index)
                    matcher = Matcher(nlp.vocab, validate=True)
                    matcher.add(new_id, [pattern])
                    self.matchers[trigger_info['trigger'].append(matcher)
             ```

The 'new_id' creation is the way I create a correspondence between the pattern index and its match results. In particular,
matcher = Matcher(nlp.vocab, validate=True)
I need to create matcher object per pattern, instead of per label. I am not sure whether this will use large amount of memory due to multiple matcher object (nlp.vocab). If the returned match object contains the pattern index, For each loop I will need to create one matcher only, thus reducing the number of matcher creations a lot.

There is two other reasons I didn't use EntityRulers:

First, this is a document tagging task, not entity, although using similar rules
For large rule set, the EntityRule format is more verbose, because for each pattern, you need to add the entity label into the rule.

Is that straightforward to make this extension:

[(11514107010766861231, 2, 43, 0)]
[(11514107010766861231, 2, 43, 1)]

Current users will only access tuple[0] and tuple[1] as used to be, and they don't need to care about tuple[3] if they don't want to use it.

1 reply

polm Jul 20, 2021

Thanks for the extra info about your usage pattern.

If the Matcher's match result is in this format, the API change would be minimal:

Unfortunately even just adding an element to the tuple will break code like this:

for match_id, start, end in matcher(doc):
    ...

We use this style in the docs, so it's probably widely used. So this would not be a minor API change.

I have a 3-million entry legacy rule base in a plain text file, where each line can be seen as one rule, and I want to convert them into spacy's rules.

The Matcher is not designed with supporting that many patterns in mind. It is supposed to be efficient, but if you have that many patterns I would usually recommend using a trie-based lookup to filter input instead.

Uh oh!

Identification and Access of individual patterns? #8723

Uh oh!

lingvisa Jul 14, 2021

Replies: 2 comments · 4 replies

Uh oh!

polm Jul 15, 2021

Uh oh!

Uh oh!

lingvisa Jul 15, 2021 Author

Uh oh!

Uh oh!

lingvisa Jul 15, 2021 Author

Uh oh!

polm Jul 17, 2021

Uh oh!

Uh oh!

lingvisa Jul 19, 2021 Author

Uh oh!

polm Jul 20, 2021

lingvisa
Jul 14, 2021

Replies: 2 comments 4 replies

polm
Jul 15, 2021

lingvisa Jul 15, 2021
Author

lingvisa Jul 15, 2021
Author

lingvisa
Jul 19, 2021
Author