Partial match of spans in Matcher with ENT_TYPE #4930
-
|
If I create a Matcher trying to match two GPEs with a + sign in-between (GPE+GPE), the matcher only matches the last token of the first GPE and the first token of the second GPE. I would expect to match the full content of both GPEs. So, for "San Francisco + New York" I would expect the matcher to match the whole sentence, while in reality it only matches "Francisco + New". The Rule Matcher Explorer behaves the same way, so I am not sure this is in fact the expected behaviour. This returns: I would have expected: Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
What you're trying to do makes sense, but you have to take into account that the To match more than one token, you can use the You could loop through these and select the longest match from the overlapping ones. |
Beta Was this translation helpful? Give feedback.
What you're trying to do makes sense, but you have to take into account that the
Matcheralways matches onTokenlevel. So the expression{"ENT_TYPE": "GPE"}matches exactly oneTokenwhich is part of aGPEentity, which is why you're getting just "Francisco" and just "New" instead of the full entity. Because each entity consists of two tokens.To match more than one token, you can use the
+operator like so:patterns = [{"ENT_TYPE": "GPE", "OP": "+"}, {"ORTH": "+"}, {"ENT_TYPE": "GPE", "OP": "+"}]Before 2.1.0, this operator would behave greedily and would pretty much return exactly what you want. Unfortunately because of possible mixing of operators, this greedy behaviour was not consis…