Adding a lemma for a new word and the concept of normalization/lemmatization in spaCy #12990

igormorgado · 2023-09-19T14:23:57Z

igormorgado
Sep 19, 2023

Following the examples from documentation regarding tokenization I have the following code:

import spacy
from spacy.symbols import ORTH, NORM

nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

doc = nlp("gimme that. he gave me that. Going to someplace.")

Then I check the tokenization

doc[0].norm_  # 'give'  (as expected)

But the lemmatizer does not return the same output

lemmatizer = nlp.get_pipe("lemmatizer")
lemmatizer.lemmatize(doc[0])  # ['gim']  (expected ['give']

In other hand

lemmatizer.lemmatize(doc[5]) # ['give']
lemmatizer.lemmatize(doc[9]) # [go']

What I'm doing wrong? How to "fix"? In spaCy what is the difference between normalized tokens and lemmatized tokens? How can I "teach" the lemmatization of a single token (as this gim token in example) ?

igormorgado · 2023-09-21T20:31:53Z

igormorgado
Sep 21, 2023
Author

Got this partial solution

import spacy
from spacy.language import Language
from spacy.symbols import ORTH, NORM
        
nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
        
# Define a custom lemmatization function
@Language.component(name="custom_lemmatizer")
def custom_lemmatizer_function(doc):
    for token in doc:
        if token.norm_ == "give":
            token.lemma_ = "give"
    # Add more custom rules for other words if needed
    return doc
        
# Add the custom lemmatizer to the pipeline
nlp.add_pipe("custom_lemmatizer", name="custom_lemmatizer", after="lemmatizer")
        
doc = nlp("gimme that. he gave me that. Going to someplace.")
print(doc[0].lemma_)  # 'give' (as expected)
print(doc[5].lemma_)  # 'give' (as expected)
print(doc[9].lemma_)  # 'go' (as expected)

But IMHO, the for loop inside the custom_lemmatizer_function can become slow in a large corpus with complex rules. Is there a better approach for this?

0 replies

danieldk · 2023-09-22T11:49:36Z

danieldk
Sep 22, 2023

Great question! The issue that you are running into is that the rule-based lemmatizer processed the lowercase orthographic forms (the tokens as they appear in the text) and not the normalized forms.

You can resolve this issue by adding an exception to the tokenizer. See this earlier discussion for an example:

#9632 (comment)

2 replies

igormorgado Sep 22, 2023
Author

Sorry. I have followed the directions in that discussion. But seems not work here.

import spacy
from spacy.language import Language
from spacy.symbols import ORTH, NORM
        
nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
lemmatizer = nlp.get_pipe('lemmatizer')

lemmatizer.lookups.get_table('lemma_exc')['verb']['gim'] = ['give']
lemmatizer.cache = {}

doc = nlp("gimme that. he gave me that. Going to someplace")

fmt = "{:>10s} {:>10s} {:>10s} {:>10s} {:>10s}"
print(fmt.format("Text", "Orth", "Norm", "PoS", "Lemma"))
print("=" * 55)
for token in doc:
    print(fmt.format(token.text, token.orth_, token.norm_, token.pos_, token.lemma_))

And the output does not seems what I expect (as can be seen in the lemma in first entry (text gim)

      Text       Orth       Norm        PoS      Lemma
=======================================================
       gim        gim       give       VERB        gim
        me         me         me       PRON          I
      that       that       that       PRON       that
         .          .          .      PUNCT          .
        he         he         he       PRON         he
      gave       gave       gave       VERB       give
        me         me         me       PRON          I
      that       that       that       PRON       that
         .          .          .      PUNCT          .
     Going      Going      going       VERB         go
        to         to         to        ADP         to
 someplace  someplace  someplace       NOUN  someplace

I tried to check some of these lemmatizer lookup tables:

# [12401032943472870168,  10054248821341768113,  16833663260455505849,  6360137228241296794]

I could find two names for these tables:

nlp.vocab.strings['verb'] # 6360137228241296794

But looking got the other POS types do not return any "valid" id, for example:

nlp.vocab.strings['pron'] # 17354369376887303800 (not in the lookup table

Trying to do a reverse search for the hash -> string, returns a problem.

nlp.vocab.strings[10054248821341768113]
KeyError: "[E018] Can't retrieve string for hash '10054248821341768113'. This usually refers to an issue with the `Vocab` or `StringStore`."

Anyhow, the assigned value is in the table (and correct), but still the answer isn't as expected

lemmatizer.lookups.get_table('lemma_exc')['verb']['gim'] # return [ 'give']

But, still, doc[0].lemma_ returns

'gim'

danieldk Sep 25, 2023

Your code looks good, in this case you are running into a quirk of the lemmatizer that I was not aware of, where the exception table is not used when the token is already considered to be a base form 😢.

Unfortunately, we cannot change this for various reasons, so for cases like this it's better to use the attribute ruler or make a custom lemmatizer pipe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding a lemma for a new word and the concept of normalization/lemmatization in spaCy #12990

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Adding a lemma for a new word and the concept of normalization/lemmatization in spaCy #12990

Uh oh!

igormorgado Sep 19, 2023

Replies: 2 comments · 2 replies

Uh oh!

igormorgado Sep 21, 2023 Author

Uh oh!

danieldk Sep 22, 2023

Uh oh!

Uh oh!

igormorgado Sep 22, 2023 Author

Uh oh!

danieldk Sep 25, 2023

igormorgado
Sep 19, 2023

Replies: 2 comments 2 replies

igormorgado
Sep 21, 2023
Author

danieldk
Sep 22, 2023

igormorgado Sep 22, 2023
Author