Testing Polish models #5859
Replies: 5 comments
-
Hi, the Polish lemmatizer is a lookup lemmatizer with POS-based tables. The lemmatizer itself is here (there's some lowercasing and handling of frequent affixes): https://github.com/explosion/spaCy/blob/ac14ce7c30c2f6da4a71dee7978f5b765af4d966/spacy/lang/pl/lemmatizer.py You can see the current data here in the tables that start with The POS tags are from a statistical model, so it's possible that an incorrect POS tag is leading to an incorrect lookup, or that the lookup table itself contains some incorrect entries. It looks like the lookup table entries are the source of these results, but I'm afraid I don't know enough Polish to know whether they are 100% mistakes or possibly an acceptable result in another context or with another meaning. Since the lemmatizer doesn't take context into account, it may not be possible to modify it to get the right results in all cases. |
Beta Was this translation helpful? Give feedback.
-
List all Polish noun lemman which lemma is not lemma, or confused with lemma other rare word.
among others: |
Beta Was this translation helpful? Give feedback.
-
I know that the original data came from this resource: http://morfeusz.sgjp.pl/download/ in particular the "Słownik (SGJP) (dane tekstowe)" dictionary. There was some further processing to split by POS and handle frequent prefixes like "nie". The examples above look like ambiguous cases in this table that can't be disambiguated by POS alone. |
Beta Was this translation helpful? Give feedback.
-
Resource morfeusz.sgjp.pl contain large number of very rare lemmas. |
Beta Was this translation helpful? Give feedback.
-
I've found one more similar issue. Word |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am begin testing Polish model, now I found one bad word->lemma in pl_core_news_sm/md/lg
in senetence "Uległ bandzie łotrów" lemma("bandzie") is "bando" instead of "banda".
second: "Kapela"->"Kapel" instead of "kapela" in "Kapela miała przerwę i spytałem czy mogę cię odprowadzić"
(whereas lowercase kapela is ok)
(but I think 100%correctness is not possible due to statistical learning models?)
aslo I think, Issue for single bad lemmas are inapropiate, this is only sample.
How to reproduce the behaviour
import spacy
nlpPL = spacy.load("pl_core_news_md")
doc = nlpPL("Uległ bandzie łotrów")
for token in doc:
print(token.lemma_)
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions