Apostrophes: It's Jess' n' Sam's car. #12552
-
How to reproduce the behaviournlp = spacy.load('en_core_web_lg')
for t in list(nlp("It's Jess' n' Sam's car. No, it's just Jess'.")):
print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}') Output:
For context, in post processing I want to merge contractions and possessives into a single word. The incorrect annotations above are indistinguishable from when for t in list(nlp("I like 'apples'.")):
print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}') Output:
So I can't distinguish the apostrophes that I want to merge from the single quotation marks that I don't want to merge. Given my goal, it probably makes more sense to have a new type of annotation that tells you whether tokens are part of the same word, since some languages may have multi-token words that are not separated by apostrophes. Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
For abbreviations like Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using For example:
|
Beta Was this translation helpful? Give feedback.
-
Are English contractions still an issue with the latest version of spaCy (3.7.1)? |
Beta Was this translation helpful? Give feedback.
For abbreviations like
n'
it might be better to have a rule-based exception. For the possessive vs. quote cases, this distinction is present in the annotation scheme for the training corpus, but it's likely that this is ambiguous enough and training examples are rare enough that the trained pipelines likeen_core_web_lg
are going to make a fair number of mistakes.Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using
en_core_web_trf
, which at least for these cases seems to p…