Rule-based matching should be able to specify beginning/end of sentence #5273
-
Feature descriptionI would like to match token sequences if and only if the occur at the end (or more rarely) the beginning of a sentence. (This would let me match headings like "Risk factors:\n" but avoid non-heading sentences like "here are the risk factors: financial challenges, history of domestic violence.") Ideally it would look like this: pattern = [{"LOWER": "risk"}, {"LOWER": "factors"}, {"TEXT": ":"}, {"END_OF_SENT": True}]
matcher.add("RiskFactorsHeading", None, pattern)I tried to get clever with regex'es and do something like this: pattern = [{"LOWER": "risk"}, {"LOWER": "factors"}, {"TEXT": ":"}, {"TEXT": {"REGEX": "."}, "OP": "!"}]
matcher.add("RiskFactorsHeading", None, pattern)but that didn't seem to work. Could the feature be a custom component or spaCy plugin?Not as far as I can tell. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
|
Currently the beginning of the sentence is supported as There is a PR (#4697) for a similar |
Beta Was this translation helpful? Give feedback.
-
|
Ah, that's great news. Thanks. (I hadn't seen this on the docs or in the demo.) That's great news. |
Beta Was this translation helpful? Give feedback.
-
|
@adrianeboyd I can get |
Beta Was this translation helpful? Give feedback.
-
|
from #5375, it seems like |
Beta Was this translation helpful? Give feedback.
-
|
No, because of how the information is saved internally for the doc, the Alternatively, if speed is less of an issue, you could potentially copy |
Beta Was this translation helpful? Give feedback.
-
|
Got it, thanks for the reply! |
Beta Was this translation helpful? Give feedback.
Currently the beginning of the sentence is supported as
{"IS_SENT_START": True}. There was a bug in v2.2.3 that has been fixed in v2.2.4 for this feature.There is a PR (#4697) for a similar
SENT_ENDfeature, which we need to have another look at, since I think it was 99% finished.