Apostrophes: It's Jess' n' Sam's car. #12552

ryanheise · 2023-03-27T07:18:07Z

ryanheise
Mar 27, 2023

How to reproduce the behaviour

nlp = spacy.load('en_core_web_lg')
for t in list(nlp("It's Jess' n' Sam's car. No, it's just Jess'.")):
    print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}')

Output:

It       it       PRON     PRP
's       be       AUX      VBZ       <--- correct
Jess     Jess     PROPN    NNP
'        '        PART     POS       <--- correct
n        n        CCONJ    CC
'        '        PUNCT    ''        <--- incorrect
Sam      Sam      PROPN    NNP
's       's       PART     POS       <--- correct
car      car      NOUN     NN
.        .        PUNCT    .
No       no       INTJ     UH
,        ,        PUNCT    ,
it       it       PRON     PRP
's       be       AUX      VBZ       <--- correct
just     just     ADV      RB
Jess     Jess     PROPN    NNP
'        '        PUNCT    ''        <--- incorrect
.        .        PUNCT    .

For context, in post processing I want to merge contractions and possessives into a single word. The incorrect annotations above are indistinguishable from when ' is used merely as a single quotation mark:

for t in list(nlp("I like 'apples'.")):
    print(f'{t.text:8} {t.lemma_:8} {t.pos_:8} {t.tag_:8}')

Output:

I        I        PRON     PRP
like     like     VERB     VBP
'        '        PUNCT    ``
apples   apple    NOUN     NNS
'        '        PUNCT    ''        <--- indistinguishable
.        .        PUNCT    .

So I can't distinguish the apostrophes that I want to merge from the single quotation marks that I don't want to merge.

Given my goal, it probably makes more sense to have a new type of annotation that tells you whether tokens are part of the same word, since some languages may have multi-token words that are not separated by apostrophes.

Your Environment

spaCy version: 3.5.0
Platform: Linux-6.2.6-arch1-1-x86_64-with-glibc2.37
Python version: 3.10.10
Pipelines: en_core_web_lg (3.5.0)

Answered by adrianeboyd

Apr 20, 2023

For abbreviations like n' it might be better to have a rule-based exception. For the possessive vs. quote cases, this distinction is present in the annotation scheme for the training corpus, but it's likely that this is ambiguous enough and training examples are rare enough that the trained pipelines like en_core_web_lg are going to make a fair number of mistakes.

Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using en_core_web_trf, which at least for these cases seems to p…

View full answer

adrianeboyd · 2023-04-20T06:36:10Z

adrianeboyd
Apr 20, 2023

For abbreviations like n' it might be better to have a rule-based exception. For the possessive vs. quote cases, this distinction is present in the annotation scheme for the training corpus, but it's likely that this is ambiguous enough and training examples are rare enough that the trained pipelines like en_core_web_lg are going to make a fair number of mistakes.

Aside from the general recommendation that you can improve the performance by training or fine-tuning a model with more of these kinds of examples (#3052), I'd recommend looking at the dependency parse along with the POS tags to distinguish these cases and consider using en_core_web_trf, which at least for these cases seems to perform a bit better. Obviously you'd need to evaluate this carefully for your data.

For example:

en_core_web_lg

It       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
Jess     Jess     PROPN    NNP      attr    
'        '        PART     POS      case    
n        n        CCONJ    CC       cc      
'        '        PUNCT    ''       punct   
Sam      Sam      PROPN    NNP      poss    
's       's       PART     POS      case    
car      car      NOUN     NN       attr    
.        .        PUNCT    .        punct   
No       no       INTJ     UH       intj    
,        ,        PUNCT    ,        punct   
it       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
just     just     ADV      RB       advmod  
Jess     Jess     PROPN    NNP      attr    
'        '        PUNCT    ''       punct   
.        .        PUNCT    .        punct   
I        I        PRON     PRP      nsubj   
like     like     VERB     VBP      ROOT    
'        '        PUNCT    ``       punct   
apples   apple    NOUN     NNS      dobj    
'        '        PUNCT    ''       punct   
.        .        PUNCT    .        punct   

en_core_web_trf

It       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ROOT    
Jess     Jess     PROPN    NNP      poss    
'        '        PART     POS      case    
n        n        CCONJ    CC       cc      
'        '        CCONJ    CC       cc      
Sam      Sam      PROPN    NNP      conj    
's       's       PART     POS      case    
car      car      NOUN     NN       attr    
.        .        PUNCT    .        punct   
No       no       INTJ     UH       intj    
,        ,        PUNCT    ,        punct   
it       it       PRON     PRP      nsubj   
's       be       AUX      VBZ      ccomp   
just     just     ADV      RB       advmod  
Jess     Jess     PROPN    NNP      attr    
'        '        PART     POS      case    
.        .        PART     POS      case    
I        I        PRON     PRP      nsubj   
like     like     VERB     VBP      ROOT    
'        '        PUNCT    ``       punct   
apples   apple    NOUN     NNS      dobj    
'        '        PUNCT    .        punct   
.        .        PUNCT    .        punct

0 replies

StEvUgnIn · 2023-10-05T13:06:29Z

StEvUgnIn
Oct 5, 2023

Are English contractions still an issue with the latest version of spaCy (3.7.1)?

#12920

3 replies

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I've tried to find it")
assert [t.lemma_ for t in doc] == ['I', 'have', 'try', 'to', 'find', 'it']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Apostrophes: It's Jess' n' Sam's car. #12552

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Apostrophes: It's Jess' n' Sam's car. #12552

Uh oh!

ryanheise Mar 27, 2023

How to reproduce the behaviour

Your Environment

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

adrianeboyd Apr 20, 2023

Uh oh!

StEvUgnIn Oct 5, 2023

Uh oh!

adrianeboyd Oct 6, 2023

Uh oh!

StEvUgnIn Oct 16, 2023

Uh oh!

adrianeboyd Oct 17, 2023

ryanheise
Mar 27, 2023

Replies: 2 comments 3 replies

adrianeboyd
Apr 20, 2023

StEvUgnIn
Oct 5, 2023