Updating spaCy model for German on automatically generated sentences #10047

gfranzini · 2022-01-13T10:23:20Z

gfranzini
Jan 13, 2022

Hi,

In order to train/update the Spacy NER model for German on a list of local (South Tyrol, Italy) placenames, a series of sentences that follow the pattern "[Name] is a river", "[Name] is a mountain", etc. was constructed.

Example data in our list of placenames:

Placename	Type
Virgenkopf	Berggelände (transl. "mountains")
Törlweg	Weg (transl. "street")
Thomaserkapelle	Kapelle (transl. "chapel")

Here are some example of the sentences generated automatically for Spacy training:

Virgenkopf	B-LOC
ist	O
ein	O
Berggelände.	O
Törlweg		B-LOC 
ist	O
ein	O
Weg.	O
Thomaserkapelle	B-LOC
ist	O
ein	O
Kapelle. O

However, once the model was trained and tested on a text, it appeared to sometimes extract full sentences or random groups of words instead of Named Entities. For example:


sind die Machtverhältnisse der Weine der Ebene südlich von Bozen. Es handelt sich hier hauptsächlich um Weißweine, die keinen rechten Markt finden. — Schützet die Obstbäume vor Haseufraß.	LOC
gleichzeitig eine große Gefahr für die jungen Obstbäume hereingebrochen. Alle«	LOC
Obstzüchtern	LOC
bekannt	LOC
Meister Lampe	LOC
gerne die jungen Obstbäume angeht. Am bevorzug testen find von ihm die Aepfel und von diesen beson ders die dickrindigen Sorten wie Ananas, Renette	LOC

Another test was made with slightly more elaborated sentences (see following examples). The same sentences were repeated for geographical names that shared the same number of words (one-word and two-word names being the most common ones - sentences containing them were the most frequent).

Ich	O 
weiß,	O 
dass	O 
Virgenkopf	B-LOC 
ein	O 
Berggelände	O 
ist.	O 
Wer	O 
hätte	O 
gedacht,	O 
dass	O 
Niederes	B-LOC 
Umbaltörl	I-LOC 
ein	O 
Joch	O 
ist.	O 
Ich	O 
habe	O 
gestern	O 
Wegele     B-LOC 
außerm      I-LOC 
Bachlan      I-LOC 
gesehen.	O

In this case, the output seemed to improve (smaller bits of sentences were extracted, as opposed to longer ones), although it seemed to get worse in terms of NER if compared with the original Spacy model.


die Machtverhältnisse der Weine	LOC
Bozen	LOC
sich	LOC
hauptsächlich um Weißweine	LOC
keinen	LOC
Markt finden.	LOC

What am I doing wrong and how should I fix it?
Thanks!

My Code (Colab)

python -m pip install -U spacy 

python -m spacy download de_core_news_lg 

python -m spacy convert -c iob -s -n 10 -b de_core_news_lg /content/flur_train.iob . 

python -m spacy convert -c iob -s -n 10 -b de_core_news_lg /content/flur_dev.iob . 

python -m spacy init fill-config base_config.cfg config.cfg 

python -m spacy train config.cfg --output /content/drive/MyDrive/output_flur --paths.train /content/flur_train.spacy --paths.dev /content/flur_dev.spacy 

import spacy 
 

nlp1 = spacy.load("/content/drive/MyDrive/output_flur/model-best")  
doc = nlp1(text)    # previously uploaded text  

 
ner_list = [[ent.text, spacy.explain(ent.label_)] for ent in doc.ents] 
ner_list

My Environment

Operating System: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python Version Used: 3.7.12
spaCy Version Used: 3.2.1
Pipelines: de_core_news_lg (3.2.0)

Answered by polm

Jan 16, 2022

Please share your config so we can see what settings you're using.

As a general issue, while creating silver data using templates is an option, it seems like you may have overly limited variation. Do your sentences vary in structure or do you just have the one sentence with a blank spot for a place name in it? If the latter I wouldn't expect good performance. (Your initial templates, like "X is a mountain", are just not enough to be useful.)

I would really encourage you to get real training data, even if only a few hundred examples, but if you can't then what you can do is take real data about other places (say from Wikipedia), run the pretrained model on it, and use those sentences as tr…

View full answer

polm · 2022-01-16T12:06:52Z

polm
Jan 16, 2022

Please share your config so we can see what settings you're using.

As a general issue, while creating silver data using templates is an option, it seems like you may have overly limited variation. Do your sentences vary in structure or do you just have the one sentence with a blank spot for a place name in it? If the latter I wouldn't expect good performance. (Your initial templates, like "X is a mountain", are just not enough to be useful.)

I would really encourage you to get real training data, even if only a few hundred examples, but if you can't then what you can do is take real data about other places (say from Wikipedia), run the pretrained model on it, and use those sentences as training data. Before using them just modify them to replace the entities in the text with your entities - even if the sentences are factually wrong ("Berlin is the largest city in Ohio"), they should be structurally good enough to improve on what you have now.

1 reply

gfranzini Jan 18, 2022
Author

Thank you very much, this makes sense! We'll follow your suggestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Updating spaCy model for German on automatically generated sentences #10047

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Updating spaCy model for German on automatically generated sentences #10047

Uh oh!

gfranzini Jan 13, 2022

My Code (Colab)

My Environment

Replies: 1 comment · 1 reply

Uh oh!

polm Jan 16, 2022

Uh oh!

gfranzini Jan 18, 2022 Author

gfranzini
Jan 13, 2022

Replies: 1 comment 1 reply

polm
Jan 16, 2022

gfranzini Jan 18, 2022
Author