Informal contractions are not lemmatized properly #9985

jambran · 2022-01-04T21:31:08Z

jambran
Jan 4, 2022

How to reproduce the behaviour

I'm seeking to parse sentences that have informal contractions like wanna, gonna, gimme.

I'd like wanna to be tokenized as wan, na and lemmatized as want, to.
I'd like wanna to be tokenized as gon, na and lemmatized as go, to.
I'd like gimme to be tokenized as gim, me and lemmatized as give, I.

Out-of-the-box behavior

Out of the box, we have this request

nlp = spacy.load('en_core_web_sm')
sentence = "we're gonna have a great day"
print(f'For sentence {sentence}, out of the box, we have lemmas')
print([token.lemma_ for token in nlp(sentence)])

outputting

For sentence we're gonna have a great day, out of the box, we have lemmas
['we', 'be', 'gon', 'na', 'have', 'a', 'great', 'day']

No good, we have gon and na, tokenized properly but not lemmatized properly.

Adding exceptions to the lemmatizer

I've looked into options for customizing the lemmatization process, and found this stack overflow post about adding rules using the following code snippet.

nlp.get_pipe('lemmatizer').lookups.get_table("lemma_exc")["noun"]["data"] = ["data"]

Adding these exceptions works fine for gonna but not wanna. The difference in POS is what causes the difference, and the rules based lemmatization system sees wan as the base form of the word, so skips the lemmatization process.

### Let's add custom exceptions for these words
exceptions = [("gon", "go"),  # `gonna`
              ("gim", "give"),  # `gimme`
              ("wan", "want"),  # `wanna`
              ]
lemmatizer = nlp.get_pipe('lemmatizer')
for slang, lemma in exceptions:
    lemmatizer.lookups.get_table("lemma_exc")['verb'][slang] = [lemma]
    
print('adding exceptions for informal contractions yields:')
print([token.lemma_ for token in nlp(sentence)])

output

adding exceptions for informal contractions yields:
['we', 'be', 'gon', 'na', 'have', 'a', 'great', 'day']

Still, we see gon and na, not the lemmas we expected.

Customize the lemmatizer

@English.factory(
    "custom_english_lemmatizer",
    assigns=["token.lemma"],
    default_config={},
    default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
    nlp: Language,
    name: str = 'custom_english_lemmatizer',
):
    return CustomEnglishLemmatizer(nlp, name, mode='rule')


class CustomEnglishLemmatizer(EnglishLemmatizer):
    """
    In `en_core_web_sm`, words like "gonna" are getting lemmatized as "gon", "na"
    
    This custom lemmatizer allows us more control over the lemmatization process

    Only overrides is_base_form.
    """

    def is_base_form(self, token: Token) -> bool:
        """
        Check whether we're dealing with an uninflected paradigm, so we can
        avoid lemmatization entirely.

        univ_pos (unicode / int): The token's universal part-of-speech tag.
        morphology (dict): The token's morphological features following the
            Universal Dependencies scheme.
        """
        # add additional check for slang words
        # that aren't base form but do match the above condition
        # words like "wanna", "gimme"
        if token.norm_ != token.text:
            return False
        super(CustomEnglishLemmatizer, self).is_base_form(token)

Now running nlp as below outputs the proper lemma for gon

# Let's try customizing the lemmatizer
  nlp = spacy.load('en_core_web_sm', exclude='lemmatizer')
  custom_lemmatizer = nlp.add_pipe("custom_english_lemmatizer",
                                   name='lemmatizer',
                                   last=True)
  custom_lemmatizer.initialize()
  exceptions = [("gon", "go"),  # `gonna`
                ("gim", "give"),  # `gimme`
                ("wan", "want"),  # `wanna`
                ]
  for slang, lemma in exceptions:
      # this table has entries for verb, noun, adjective, adverb,
      # but not part, which is what we need for -na in gonna, wanna
      custom_lemmatizer.lookups.get_table("lemma_exc")['verb'][slang] = [lemma]
      
  print('with custom lemmatizer, we have: ')
  print([token.lemma_ for token in nlp(sentence)])

with custom lemmatizer, we have: 
['we', 'be', 'go', 'na', 'have', 'a', 'great', 'day']

Woohoo! We have the proper lemmatization for gon!

But what about `na`?

This gets us almost there, with the correct verb form for gon. So how do we get na to lemmatize to to?
Adding another exception for the POS part raises an error.

Traceback (most recent call last):
  File "/Users/jamiebrandon/Code/api-writing-tool/scripts/bug_report.py", line 92, in <module>
    custom_lemmatizer.lookups.get_table("lemma_exc")['part']['na'] = ['to']
  File "/Users/jamiebrandon/Code/api-writing-tool/venv/lib/python3.8/site-packages/spacy/lookups.py", line 109, in __getitem__
    return OrderedDict.__getitem__(self, key)
KeyError: 4485934323942657167

It seems that there can only be exceptions for NOUN, VERB, ADJECTIVE, and ADVERB.

How can I handle informal contractions like wanna, gonna, gimme such that they are lemmatized properly? It seems there is some work on this, as I'm seeing code snippets to handle gonna here. Perhaps I am not using this as intended?

Appreciate any help. Thank you for building an incredible tool!

Your Environment

spaCy version: 3.0.1
Platform: macOS-11.6.1-x86_64-i386-64bit
Python version: 3.8.2
Pipelines: en_core_web_sm (3.0.0)

Answered by polm

Jan 5, 2022

So one thing is that while your level of detail here is helpful for understanding your objectives, it would be really helpful if you would provide a single piece of code we could copy and paste. To actually run your code I have to stick it together, add imports, and figure out the intended order.

Regarding the KeyError you pasted, it's a little hard to understand because of the transformation of the string key to a hash, but basically you're just trying to do something with a key that doesn't exist. The fix is easy:

exceptions = custom_lemmatizer.lookups.get_table("lemma_exc")
exceptions["part"] = {} # just add a table / dict for "part"
exceptions['part']['na'] = ['to']

Here is a complet…

View full answer

polm · 2022-01-05T06:46:44Z

polm
Jan 5, 2022

So one thing is that while your level of detail here is helpful for understanding your objectives, it would be really helpful if you would provide a single piece of code we could copy and paste. To actually run your code I have to stick it together, add imports, and figure out the intended order.

Regarding the KeyError you pasted, it's a little hard to understand because of the transformation of the string key to a hash, but basically you're just trying to do something with a key that doesn't exist. The fix is easy:

exceptions = custom_lemmatizer.lookups.get_table("lemma_exc")
exceptions["part"] = {} # just add a table / dict for "part"
exceptions['part']['na'] = ['to']

Here is a complete bit of code where the lemma for "na" is "to".

import spacy
from spacy.lang.en import English, EnglishLemmatizer
from spacy.language import Language
from spacy.tokens import Token

@English.factory(
    "custom_english_lemmatizer",
    assigns=["token.lemma"],
    default_config={},
    default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
    nlp: Language,
    name: str = 'custom_english_lemmatizer',
):
    return CustomEnglishLemmatizer(nlp, name, mode='rule')


class CustomEnglishLemmatizer(EnglishLemmatizer):
    """
    In `en_core_web_sm`, words like "gonna" are getting lemmatized as "gon", "na"
    
    This custom lemmatizer allows us more control over the lemmatization process

    Only overrides is_base_form.
    """

    def is_base_form(self, token: Token) -> bool:
        """
        Check whether we're dealing with an uninflected paradigm, so we can
        avoid lemmatization entirely.

        univ_pos (unicode / int): The token's universal part-of-speech tag.
        morphology (dict): The token's morphological features following the
            Universal Dependencies scheme.
        """
        # add additional check for slang words
        # that aren't base form but do match the above condition
        # words like "wanna", "gimme"
        if token.norm_ != token.text:
            return False
        super(CustomEnglishLemmatizer, self).is_base_form(token)


# Let's try customizing the lemmatizer
nlp = spacy.load('en_core_web_sm', exclude='lemmatizer')
custom_lemmatizer = nlp.add_pipe("custom_english_lemmatizer",
                               name='lemmatizer',
                               last=True)
custom_lemmatizer.initialize()
exceptions = [("gon", "go"),  # `gonna`
            ("gim", "give"),  # `gimme`
            ("wan", "want"),  # `wanna`
            ]
for slang, lemma in exceptions:
  # this table has entries for verb, noun, adjective, adverb,
  # but not part, which is what we need for -na in gonna, wanna
  custom_lemmatizer.lookups.get_table("lemma_exc")['verb'][slang] = [lemma]

exceptions = custom_lemmatizer.lookups.get_table("lemma_exc")
exceptions["part"] = {}
exceptions['part']['na'] = ['to']
  
print('with custom lemmatizer, we have: ')
sentence = "we gonna go there"
for token in nlp(sentence):
    print(token, token.lemma_, token.pos_, sep="\t")

8 replies

jambran Jan 5, 2022
Author

@polm , wait, shouldn't the lemma for gonna be go not going? There is an inconsistency here, no?

import spacy 


### Inconsistency between lemmatization of `going` and `gonna`
nlp = spacy.load('en_core_web_sm')
print(f"lemmas for `gonna`: {[token.lemma_ for token in nlp('gonna')]}")
print(f"lemmas for `going to`: {[token.lemma_ for token in nlp('going to')]}")

output is

lemmas for `gonna`: ['going', 'to']
lemmas for `going to`: ['go', 'to']

🤔

polm Jan 6, 2022

Hm, you're right. Also I assumed those exceptions were new, but they've actually been around quite a while... So not sure what's going on here. Let me check with the team on recent lemma changes.

adrianeboyd Jan 10, 2022

You don't need a custom lemmatizer for this.

You can change the behavior here by modifying the exceptions in the attribute_ruler rules.

The current v3.2.0 en trained pipeline "gonna" and "gotta" exceptions accidentally have NORM instead of LEMMA from the copied v2 tokenizer exceptions. We'll plan to fix for that for the next model release (probably model version v3.3.0).

roger-yu-ds Mar 6, 2022

@adrianeboyd Is there a sample code for splitting one word into multiple tokens? Like "gimme"? The guide uses two words.

The code below shows that it works for "gim me" but not for "gimme":

patterns = [
    [{"LOWER": "gim"}, {"TEXT": "me"}]
]
attrs = [
    {"LEMMA": "give", "TAG": "VB", "POS": "VERB"},
    {"LEMMA": "I", "TAG": "PRP", "POS": "PRON"},
]

nlp = spacy.load("en_core_web_sm")
ruler = nlp.get_pipe("attribute_ruler")
ruler.add(patterns=patterns, attrs=attrs[0], index=0)
ruler.add(patterns=patterns, attrs=attrs[1], index=1)

text1 = "Gimme that!"
print([token.lemma_ for token in nlp(text1)])
# ['Gimme', 'that', '!']

text2 = "Gim me that!"
print([token.lemma_ for token in nlp(text2)])
# ['give', 'I', 'that', '!']

I tried {"WHITESPACE": False}, but that lead to an error.

polm Mar 6, 2022

@roger-yu-ds Don't @ specific maintainers to get their attention, and if you have a new question please open a new issue.

Did you see the section on splitting tokens in the docs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Informal contractions are not lemmatized properly #9985

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Informal contractions are not lemmatized properly #9985

Uh oh!

jambran Jan 4, 2022

How to reproduce the behaviour

Out-of-the-box behavior

Adding exceptions to the lemmatizer

Customize the lemmatizer

But what about na?

Your Environment

Replies: 1 comment · 8 replies

Uh oh!

polm Jan 5, 2022

Uh oh!

Uh oh!

jambran Jan 5, 2022 Author

Uh oh!

polm Jan 6, 2022

Uh oh!

adrianeboyd Jan 10, 2022

Uh oh!

roger-yu-ds Mar 6, 2022

Uh oh!

polm Mar 6, 2022

jambran
Jan 4, 2022

But what about `na`?

Replies: 1 comment 8 replies

polm
Jan 5, 2022

jambran Jan 5, 2022
Author