How can I use SpaCy Matcher (or PhraseMatcher) class for the extracting the sequence of 2 items? #10120

BriskyGates · 2022-01-24T00:36:12Z

BriskyGates
Jan 24, 2022

I'm trying to move from NLTK to Spacy, and one of the functionalities I need is matching "subtrees" with regex. In the simple cases Matcher is doing just fine:

matcher = Matcher(nlp.vocab)
matcher.add('GRAMMAR', None, [{'TAG': 'JJ', 'OP': '+'}, {'POS': NOUN', 'OP': '+'}])

The problem starts when I need to match only one of the groups. For instance, if I need a noun following an an adjective, but I only want to match the noun and not the entire pattern. In a simple regex, I would put the desired group in parenthesis like so (with an imaginary function):

r'JJ+(NOUN+)'

My temporary solution is to grab only some of the tokens in a callback function, like this:

hits = []
matcher = Matcher(nlp.vocab)
matcher.add('GRAMMAR', lambda matcher, doc, i, matches: hits.append(('GRAMMAR', doc[matches[i][1]+1:matches[i][2]].text)), [{'TAG': 'JJ', 'OP': '+'}, {'POS': 'NOUN', 'OP': '+'}])

However, this solution suffers from several problems:

I want to extract the list of patterns to an external source, so the callback function must be the same for all, though for each pattern I need to select a different group (sometimes the first, sometimes the second, sometimes the entire pattern).
My solution counts tokens. If the pattern involves operators (e.g. *, +) I can't necessarily know at which token my desired match starts/ends.
I'm not sure about this, but I might want to avoid appending the matches to an external list. I prefer a solution that still keeps the matches inside the Matcher object.

Answered by polm

Jan 24, 2022

I think you just need to use the with_alignments feature, which will give you a list that tells you which rule in the input pattern matches each token in the match. It's a relatively new feature but will let you map your matched tokens back to where in the rule they match, so you can make the non-required parts optional.

View full answer

polm · 2022-01-24T04:58:52Z

polm
Jan 24, 2022

I think you just need to use the with_alignments feature, which will give you a list that tells you which rule in the input pattern matches each token in the match. It's a relatively new feature but will let you map your matched tokens back to where in the rule they match, so you can make the non-required parts optional.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How can I use SpaCy Matcher (or PhraseMatcher) class for the extracting the sequence of 2 items? #10120

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How can I use SpaCy Matcher (or PhraseMatcher) class for the extracting the sequence of 2 items? #10120

Uh oh!

Uh oh!

BriskyGates Jan 24, 2022

Replies: 1 comment

Uh oh!

polm Jan 24, 2022

BriskyGates
Jan 24, 2022

polm
Jan 24, 2022