Updating spaCy model for German on automatically generated sentences #10047
-
Hi, In order to train/update the Spacy NER model for German on a list of local (South Tyrol, Italy) placenames, a series of sentences that follow the pattern "[Name] is a river", "[Name] is a mountain", etc. was constructed. Example data in our list of placenames:
Here are some example of the sentences generated automatically for Spacy training:
However, once the model was trained and tested on a text, it appeared to sometimes extract full sentences or random groups of words instead of Named Entities. For example:
Another test was made with slightly more elaborated sentences (see following examples). The same sentences were repeated for geographical names that shared the same number of words (one-word and two-word names being the most common ones - sentences containing them were the most frequent).
In this case, the output seemed to improve (smaller bits of sentences were extracted, as opposed to longer ones), although it seemed to get worse in terms of NER if compared with the original Spacy model.
What am I doing wrong and how should I fix it? My Code (Colab)
My Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Please share your config so we can see what settings you're using. As a general issue, while creating silver data using templates is an option, it seems like you may have overly limited variation. Do your sentences vary in structure or do you just have the one sentence with a blank spot for a place name in it? If the latter I wouldn't expect good performance. (Your initial templates, like "X is a mountain", are just not enough to be useful.) I would really encourage you to get real training data, even if only a few hundred examples, but if you can't then what you can do is take real data about other places (say from Wikipedia), run the pretrained model on it, and use those sentences as training data. Before using them just modify them to replace the entities in the text with your entities - even if the sentences are factually wrong ("Berlin is the largest city in Ohio"), they should be structurally good enough to improve on what you have now. |
Beta Was this translation helpful? Give feedback.
Please share your config so we can see what settings you're using.
As a general issue, while creating silver data using templates is an option, it seems like you may have overly limited variation. Do your sentences vary in structure or do you just have the one sentence with a blank spot for a place name in it? If the latter I wouldn't expect good performance. (Your initial templates, like "X is a mountain", are just not enough to be useful.)
I would really encourage you to get real training data, even if only a few hundred examples, but if you can't then what you can do is take real data about other places (say from Wikipedia), run the pretrained model on it, and use those sentences as tr…