Informal contractions are not lemmatized properly #9985
-
How to reproduce the behaviourI'm seeking to parse sentences that have informal contractions like I'd like Out-of-the-box behaviorOut of the box, we have this request nlp = spacy.load('en_core_web_sm')
sentence = "we're gonna have a great day"
print(f'For sentence {sentence}, out of the box, we have lemmas')
print([token.lemma_ for token in nlp(sentence)]) outputting
No good, we have Adding exceptions to the lemmatizerI've looked into options for customizing the lemmatization process, and found this stack overflow post about adding rules using the following code snippet. nlp.get_pipe('lemmatizer').lookups.get_table("lemma_exc")["noun"]["data"] = ["data"] Adding these exceptions works fine for ### Let's add custom exceptions for these words
exceptions = [("gon", "go"), # `gonna`
("gim", "give"), # `gimme`
("wan", "want"), # `wanna`
]
lemmatizer = nlp.get_pipe('lemmatizer')
for slang, lemma in exceptions:
lemmatizer.lookups.get_table("lemma_exc")['verb'][slang] = [lemma]
print('adding exceptions for informal contractions yields:')
print([token.lemma_ for token in nlp(sentence)]) output
Still, we see Customize the lemmatizer@English.factory(
"custom_english_lemmatizer",
assigns=["token.lemma"],
default_config={},
default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
nlp: Language,
name: str = 'custom_english_lemmatizer',
):
return CustomEnglishLemmatizer(nlp, name, mode='rule')
class CustomEnglishLemmatizer(EnglishLemmatizer):
"""
In `en_core_web_sm`, words like "gonna" are getting lemmatized as "gon", "na"
This custom lemmatizer allows us more control over the lemmatization process
Only overrides is_base_form.
"""
def is_base_form(self, token: Token) -> bool:
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
univ_pos (unicode / int): The token's universal part-of-speech tag.
morphology (dict): The token's morphological features following the
Universal Dependencies scheme.
"""
# add additional check for slang words
# that aren't base form but do match the above condition
# words like "wanna", "gimme"
if token.norm_ != token.text:
return False
super(CustomEnglishLemmatizer, self).is_base_form(token) Now running nlp as below outputs the proper lemma for # Let's try customizing the lemmatizer
nlp = spacy.load('en_core_web_sm', exclude='lemmatizer')
custom_lemmatizer = nlp.add_pipe("custom_english_lemmatizer",
name='lemmatizer',
last=True)
custom_lemmatizer.initialize()
exceptions = [("gon", "go"), # `gonna`
("gim", "give"), # `gimme`
("wan", "want"), # `wanna`
]
for slang, lemma in exceptions:
# this table has entries for verb, noun, adjective, adverb,
# but not part, which is what we need for -na in gonna, wanna
custom_lemmatizer.lookups.get_table("lemma_exc")['verb'][slang] = [lemma]
print('with custom lemmatizer, we have: ')
print([token.lemma_ for token in nlp(sentence)])
Woohoo! We have the proper lemmatization for But what about
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
So one thing is that while your level of detail here is helpful for understanding your objectives, it would be really helpful if you would provide a single piece of code we could copy and paste. To actually run your code I have to stick it together, add imports, and figure out the intended order. Regarding the KeyError you pasted, it's a little hard to understand because of the transformation of the string key to a hash, but basically you're just trying to do something with a key that doesn't exist. The fix is easy:
Here is a complete bit of code where the lemma for "na" is "to".
|
Beta Was this translation helpful? Give feedback.
So one thing is that while your level of detail here is helpful for understanding your objectives, it would be really helpful if you would provide a single piece of code we could copy and paste. To actually run your code I have to stick it together, add imports, and figure out the intended order.
Regarding the KeyError you pasted, it's a little hard to understand because of the transformation of the string key to a hash, but basically you're just trying to do something with a key that doesn't exist. The fix is easy:
Here is a complet…