Rule-based matching should be able to specify beginning/end of sentence #5273

jl-ai · 2020-04-08T01:17:56Z

jl-ai
Apr 8, 2020

Feature description

I would like to match token sequences if and only if the occur at the end (or more rarely) the beginning of a sentence. (This would let me match headings like "Risk factors:\n" but avoid non-heading sentences like "here are the risk factors: financial challenges, history of domestic violence.")

Ideally it would look like this:

pattern = [{"LOWER": "risk"}, {"LOWER": "factors"}, {"TEXT": ":"}, {"END_OF_SENT": True}]
matcher.add("RiskFactorsHeading", None, pattern)

I tried to get clever with regex'es and do something like this:

pattern = [{"LOWER": "risk"}, {"LOWER": "factors"}, {"TEXT": ":"}, {"TEXT": {"REGEX": "."}, "OP": "!"}]
matcher.add("RiskFactorsHeading", None, pattern)

but that didn't seem to work.

Could the feature be a custom component or spaCy plugin?

Not as far as I can tell.

Answered by adrianeboyd

Apr 8, 2020

Currently the beginning of the sentence is supported as {"IS_SENT_START": True}. There was a bug in v2.2.3 that has been fixed in v2.2.4 for this feature.

There is a PR (#4697) for a similar SENT_END feature, which we need to have another look at, since I think it was 99% finished.

View full answer

adrianeboyd · 2020-04-08T08:24:17Z

adrianeboyd
Apr 8, 2020

Currently the beginning of the sentence is supported as {"IS_SENT_START": True}. There was a bug in v2.2.3 that has been fixed in v2.2.4 for this feature.

There is a PR (#4697) for a similar SENT_END feature, which we need to have another look at, since I think it was 99% finished.

0 replies

jl-ai · 2020-04-08T14:03:21Z

jl-ai
Apr 8, 2020
Author

Ah, that's great news. Thanks. (I hadn't seen this on the docs or in the demo.) That's great news.

0 replies

delzac · 2020-09-17T08:01:49Z

delzac
Sep 17, 2020

@adrianeboyd I can get {"IS_SENT_START": True} to work and that's great. But i get an exception for {"IS_SENT_END": True}. Has IS_SENT_END been implemented or am i missing something?

0 replies

delzac · 2020-09-17T08:24:33Z

delzac
Sep 17, 2020

from #5375, it seems like IS_SENT_END has been removed from the Matcher schema. Anyone has any suggest on how one might be able to match for end of sentence?

0 replies

adrianeboyd · 2020-09-17T08:31:00Z

adrianeboyd
Sep 17, 2020

No, because of how the information is saved internally for the doc, the Matcher doesn't have access to IS_SENT_END. Just using built-in token features, you'd have to look for whether the following token is IS_SENT_START.

Alternatively, if speed is less of an issue, you could potentially copy IS_SENT_END to a custom token extension and use that in your pattern.

0 replies

delzac · 2020-09-17T08:33:09Z

delzac
Sep 17, 2020

Got it, thanks for the reply!

0 replies

Uh oh!

Rule-based matching should be able to specify beginning/end of sentence #5273

Uh oh!

Uh oh!

jl-ai Apr 8, 2020

Feature description

Could the feature be a custom component or spaCy plugin?

Replies: 6 comments

Uh oh!

Uh oh!

adrianeboyd Apr 8, 2020

Uh oh!

jl-ai Apr 8, 2020 Author

Uh oh!

Uh oh!

delzac Sep 17, 2020

Uh oh!

delzac Sep 17, 2020

Uh oh!

adrianeboyd Sep 17, 2020

Uh oh!

delzac Sep 17, 2020

jl-ai
Apr 8, 2020

adrianeboyd
Apr 8, 2020

jl-ai
Apr 8, 2020
Author

delzac
Sep 17, 2020

delzac
Sep 17, 2020

adrianeboyd
Sep 17, 2020

delzac
Sep 17, 2020