Testing Polish models #5859

andr1972 · 2020-08-02T20:19:32Z

andr1972
Aug 2, 2020

I am begin testing Polish model, now I found one bad word->lemma in pl_core_news_sm/md/lg
in senetence "Uległ bandzie łotrów" lemma("bandzie") is "bando" instead of "banda".
second: "Kapela"->"Kapel" instead of "kapela" in "Kapela miała przerwę i spytałem czy mogę cię odprowadzić"
(whereas lowercase kapela is ok)
(but I think 100%correctness is not possible due to statistical learning models?)
aslo I think, Issue for single bad lemmas are inapropiate, this is only sample.

How to reproduce the behaviour

import spacy
nlpPL = spacy.load("pl_core_news_md")
doc = nlpPL("Uległ bandzie łotrów")
for token in doc:
print(token.lemma_)

Your Environment

Operating System: Linux Mint 20
Python Version Used: 3.8

adrianeboyd · 2020-08-03T06:38:13Z

adrianeboyd
Aug 3, 2020

Hi, the Polish lemmatizer is a lookup lemmatizer with POS-based tables. The lemmatizer itself is here (there's some lowercasing and handling of frequent affixes): https://github.com/explosion/spaCy/blob/ac14ce7c30c2f6da4a71dee7978f5b765af4d966/spacy/lang/pl/lemmatizer.py

You can see the current data here in the tables that start with pl_: https://github.com/explosion/spacy-lookups-data/tree/24da7d9c42432cefe26261a2b2ebc8ba259ed915/spacy_lookups_data/data

The POS tags are from a statistical model, so it's possible that an incorrect POS tag is leading to an incorrect lookup, or that the lookup table itself contains some incorrect entries. It looks like the lookup table entries are the source of these results, but I'm afraid I don't know enough Polish to know whether they are 100% mistakes or possibly an acceptable result in another context or with another meaning. Since the lemmatizer doesn't take context into account, it may not be possible to modify it to get the right results in all cases.

0 replies

andr1972 · 2020-08-04T11:50:42Z

andr1972
Aug 4, 2020
Author

List all Polish noun lemman which lemma is not lemma, or confused with lemma other rare word.
Step to reproduce

first download Polish wordnet by nltk.download()
2.Code:

import json
from nltk.corpus import wordnet  as wn

with open('/home/andrzej/Downloads/pl_lemma_lookup_noun.json') as json_file:
    data = json.load(json_file)
    cnt = 0
    for item in data:
        val = data[item]
        if val[0].islower() and wn.synsets(item, lang='pol') and not wn.synsets(val, lang='pol') \
                and val in data:
            cnt+=1
            print(cnt, item, val)

among others:
1 akademik akademika
5 antypody antypoda
12 band bando
15 bastard bastarda
42 czarnuch czarnucha
45 delfin delfina
146 kot kota
157 kuzyn kuzyna

0 replies

adrianeboyd · 2020-08-04T14:12:59Z

adrianeboyd
Aug 4, 2020

I know that the original data came from this resource: http://morfeusz.sgjp.pl/download/ in particular the "Słownik (SGJP) (dane tekstowe)" dictionary. There was some further processing to split by POS and handle frequent prefixes like "nie". The examples above look like ambiguous cases in this table that can't be disambiguated by POS alone.

0 replies

andr1972 · 2020-08-06T09:20:02Z

andr1972
Aug 6, 2020
Author

Resource morfeusz.sgjp.pl contain large number of very rare lemmas.
In can believe that "czarnucha" is feminine form for "czarnuch" (but is difficult to believe tat exists feminine form "kuzyn"->"kuzyna", common is "kuzynka, also "delfin"->"delfina)
Maybe solution would mark common lemmas existing in wordnet and mark rare lemmas. Spacy would prefer common lemmas over rare. This required changes in future version of spaCy.
Other problem is confusing proper noun with noun on sentence begin : "Kapela miała przerwę"

0 replies

mpsota · 2020-09-04T09:30:50Z

mpsota
Sep 4, 2020

I've found one more similar issue. Word nieduże (little) has incorrect lemma duży (large) - nie at the beginning is missing. Interestingly nieduży is correctly lemmatized as nieduży
Same issue with - niedobre niemożliwe (and when you replace last e to y lemma is correct again)

0 replies

Uh oh!

Testing Polish models #5859

Uh oh!

Uh oh!

andr1972 Aug 2, 2020

How to reproduce the behaviour

Your Environment

Replies: 5 comments

Uh oh!

adrianeboyd Aug 3, 2020

Uh oh!

Uh oh!

andr1972 Aug 4, 2020 Author

Uh oh!

adrianeboyd Aug 4, 2020

Uh oh!

Uh oh!

andr1972 Aug 6, 2020 Author

Uh oh!

Uh oh!

mpsota Sep 4, 2020

andr1972
Aug 2, 2020

adrianeboyd
Aug 3, 2020

andr1972
Aug 4, 2020
Author

adrianeboyd
Aug 4, 2020

andr1972
Aug 6, 2020
Author

mpsota
Sep 4, 2020