Skip to content
Discussion options

You must be logged in to vote

For abbreviations like n' it might be better to have a rule-based exception. For the possessive vs. quote cases, this distinction is present in the annotation scheme for the training corpus, but it's likely that this is ambiguous enough and training examples are rare enough that the trained pipelines like en_core_web_lg are going to make a fair number of mistakes.

Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using en_core_web_trf, which at least for these cases seems to p…

Replies: 2 comments 3 replies

Comment options

You must be logged in to vote
0 replies
Answer selected by svlandeg
Comment options

You must be logged in to vote
3 replies
@adrianeboyd
Comment options

@StEvUgnIn
Comment options

@adrianeboyd
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / en English language data and models feat / tagger Feature: Part-of-speech tagger
3 participants
Converted from issue

This discussion was converted from issue #12468 on April 20, 2023 06:36.