Where are the lemmatizer rules defined? #9632
-
I found this file in spacy-lookups-data, which defines the rules for nouns in Norwegian as:
But when I test the lemmatizer, it does not seem to follow these rules: import spacy
nlp = spacy.load('nb_core_news_lg')
lemmatizer = nlp.get_pipe('lemmatizer')
ex = 'læret på sofaen er ødelagt'
doc = nlp(ex)
token = doc[0]
rule_lemma = lemmatizer.rule_lemmatize(token)
print(rule_lemma)
print(token.text, token.lemma_, token.pos, token.pos_) Outputs:
It recognises the word as a noun, but still it does not lemmatize it to "lær," which I would expect based on these rules. So the rules above don't seem to define how the lemmatizer works. So where are the rules for the lemmatizer defined? Or is this a bug, and the above rules are supposed to be applied? And a slightly separate question: if I find mistakes in the lemmatizer, is there a way for me to fix them? E.g. make exceptions for certain words, or have separate lemmatizations for certain words when they are used as a verb vs when they are used as a noun. E.g. "lærer" in Norwegian can mean both "teacher" or "learns," and should be lemmatized to either "lærer" or "lære," depending on whether it was used as a noun or a verb. Even if the rules defined above did work, "lærer" would be lemmatized to "lære," even if it was used as a noun, which is incorrect. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The lemmatizer also includes a table of exceptions that has precedence over the rules:
You can customize any of the tables in lemmatizer.lookups.get_table("lemma_exc")["noun"]["læret"] = ["whatever"] Be aware that there's a lemmatizer cache, so you'd might not see the changes until you save and reload the model, or manually wipe out the cache: lemmatizer.cache = {} If you want these changes in a new model you're training from scratch, you'd want to have a custom install of |
Beta Was this translation helpful? Give feedback.
The lemmatizer also includes a table of exceptions that has precedence over the rules:
You can customize any of the tables in
lemmatizer.lookups.tables
to change the lemmatizer behavior. If you save the model withnlp.to_disk()
, your changes will be preserved.Be aware that there's a lemmatizer cache, so you'd might not see the changes until you save and reload the model, or manually wipe out the cache:
If you want these changes in a new model you're training from scratch, you'd want to have a custom install of
spacy-…