Partial match of spans in Matcher with ENT_TYPE #4930

fizban99 · 2020-01-21T13:36:15Z

fizban99
Jan 21, 2020

If I create a Matcher trying to match two GPEs with a + sign in-between (GPE+GPE), the matcher only matches the last token of the first GPE and the first token of the second GPE. I would expect to match the full content of both GPEs.

So, for "San Francisco + New York" I would expect the matcher to match the whole sentence, while in reality it only matches "Francisco + New".

The Rule Matcher Explorer behaves the same way, so I am not sure this is in fact the expected behaviour.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

doc = nlp("San Francisco + New York")
for ent in doc.ents:
    print(ent.label_ + ": " + ent.text)
    
matcher = Matcher(nlp.vocab)
patterns = [{"ENT_TYPE": "GPE"}, {"ORTH": "+"},{"ENT_TYPE": "GPE"}]
matcher.add("DOUBLE_GPE", None, patterns)
matches = matcher(doc)
for match_id, start, end in matches:
    print(nlp.vocab.strings[match_id] + ": " + Span(doc,start,end).text)

This returns:

GPE: San Francisco
GPE: New York
DOUBLE_GPE: Francisco + New

I would have expected:

GPE: San Francisco
GPE: New York
DOUBLE_GPE: San Francisco + New York

Your Environment

spaCy version: 2.2.3
Platform: Linux-4.15.0-52-generic-x86_64-with-debian-buster-sid
Python version: 3.7.3

Answered by svlandeg

Jan 21, 2020

What you're trying to do makes sense, but you have to take into account that the Matcher always matches on Token level. So the expression {"ENT_TYPE": "GPE"} matches exactly one Token which is part of a GPE entity, which is why you're getting just "Francisco" and just "New" instead of the full entity. Because each entity consists of two tokens.

To match more than one token, you can use the + operator like so:
patterns = [{"ENT_TYPE": "GPE", "OP": "+"}, {"ORTH": "+"}, {"ENT_TYPE": "GPE", "OP": "+"}]
Before 2.1.0, this operator would behave greedily and would pretty much return exactly what you want. Unfortunately because of possible mixing of operators, this greedy behaviour was not consis…

View full answer

svlandeg · 2020-01-21T20:21:14Z

svlandeg
Jan 21, 2020

What you're trying to do makes sense, but you have to take into account that the Matcher always matches on Token level. So the expression {"ENT_TYPE": "GPE"} matches exactly one Token which is part of a GPE entity, which is why you're getting just "Francisco" and just "New" instead of the full entity. Because each entity consists of two tokens.

To match more than one token, you can use the + operator like so:
patterns = [{"ENT_TYPE": "GPE", "OP": "+"}, {"ORTH": "+"}, {"ENT_TYPE": "GPE", "OP": "+"}]
Before 2.1.0, this operator would behave greedily and would pretty much return exactly what you want. Unfortunately because of possible mixing of operators, this greedy behaviour was not consistent, and was replaced by returning all possible mentions. So, the above pattern would give you all 4 combinations:

DOUBLE_GPE: Francisco + New
DOUBLE_GPE: San Francisco + New
DOUBLE_GPE: Francisco + New York
DOUBLE_GPE: San Francisco + New York

You could loop through these and select the longest match from the overlapping ones.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partial match of spans in Matcher with ENT_TYPE #4930

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Partial match of spans in Matcher with ENT_TYPE #4930

Uh oh!

fizban99 Jan 21, 2020

Your Environment

Replies: 1 comment

Uh oh!

Uh oh!

svlandeg Jan 21, 2020

fizban99
Jan 21, 2020

svlandeg
Jan 21, 2020