The model is not learning #10724

JessitaMS · 2022-04-28T14:15:51Z

JessitaMS
Apr 28, 2022

I have a doubt, I understand that to train the model, the more text it has, (the better it will be able to learn the labels). Now, with a practical example, the model es_core_news_lg, I understand that it cannot have all the entities of type PERSON(tag) and even so if you introduce a new text it is able to identify what it really is, for example: "Maria is beautiful", Maria is a PERSON type.

I now have a set of data with which I train my model (and as you know so that the model is not capable of overfitting). I test it with another set of data and the model is not able to identify the labels with which I trained it, it is not learning. And I don't know why this happens??

If you can help me understand this a bit, since I'm very new to the subject, thank you!

Answered by polm

May 3, 2022

NER models broadly rely on two kinds of features: token features and context features.

Token features are details of the labeled tokens themselves. This can be the whole token, like how "John" is likely to be a name and "the" isn't, or details - capitalized words are more likely to be names, words ending in "-son" are likely to be names.

Context features are drawn from the surrounding words. So a word after "Mr" or "Miss" is likely to be a name, a word after "the" is not very likely, "my name is" is a big hint, and so on.

Exactly what features are considered varies by model architecture or configuration, and how much weight each feature has, and how those weights interact, is what the mod…

View full answer

polm · 2022-05-03T13:30:32Z

polm
May 3, 2022

NER models broadly rely on two kinds of features: token features and context features.

Token features are details of the labeled tokens themselves. This can be the whole token, like how "John" is likely to be a name and "the" isn't, or details - capitalized words are more likely to be names, words ending in "-son" are likely to be names.

Context features are drawn from the surrounding words. So a word after "Mr" or "Miss" is likely to be a name, a word after "the" is not very likely, "my name is" is a big hint, and so on.

Exactly what features are considered varies by model architecture or configuration, and how much weight each feature has, and how those weights interact, is what the model learns. For more details on this I recommend the NER chapter (presently chapter 8) in the Jurafsky and Martin book. (Note that while usually the book is very accessible, that's a pretty dense chapter, so I would recommend skimming it for the parts you're interested in.)

If your model isn't working on other data it may be too different from your training data. If the text you're using has different tokens and different contexts from your training data, it may be something the model has never seen before, and it may be unable to make a prediction, in which case it defaults to not predicting.

If you give more detail about the data issues you're having we may be able to help, but note that domain adaptation is just a hard problem in general, and usually the answer is simply that you need more training data.

5 replies

polm May 3, 2022

Note it may also be helpful to go through the spaCy Course, which specifically goes over training an NER model.

JessitaMS May 11, 2022
Author

Hello:

As the course suggests (I have to tell you that I had already done it), but even so, doubts continue to arise, since I am very new to the subject.

I tell you, I think that probably in the case of the lack of data it can be correct. But currently, I have trained my model with 16 stories (in Spanish) from different authors, with which I train so that it discovers the names of characters.

So first question:
How many stories should I train with? Because of course I can't have all the cases. Not to mention that the training and test data should not be the same so that there is no overfitting of the model.

Continuing with this, I have tested my model with a new story, and many words that are not characters appear as names of characters, for example: Swimming, would they...why, because they have a capital letter?, but the model does not it is capable of knowing that when a sentence begins there is a capital letter... I don't know, I ask (because of ignorance), and I even get words like expel. More without counting that, for example, it recognizes Crespi and Pietro Crespi as 2 different entities.

I will leave you the dataset of the results of the entities. Let's see if you can help me visualize where the error is.
soledad.xlsx

Thanks,

polm May 13, 2022

There's no specific number of training examples that are enough, but usually we recommend at least 1000 examples per type. An "example" in this case would be a sentence or a paragraph. I'm not sure how long the stories you're working with are or how you're splitting them up.

The data you gave me just looks like this (omitting most lines):

	Character	TypeTale	Rep
0	Macondo	100annosdesoledad	167
1	Jose Arcadio	100annosdesoledad	372
2	Ursula	100annosdesoledad	514
3	Malasia	100annosdesoledad	1
4	Maria	100annosdesoledad	1
5	Blandiendo	100annosdesoledad	1

I guess that's a count of how many time each character was detected? I can't really tell anything about your data from that. I would need examples of annotated training data to give advice.

Since you're trying to detect character names, did you try the PERSON entity in the pretrained spaCy model? If so, how was it? I would expect it to work OK, even if it's not as good with fiction as with newspaper-style text.

JessitaMS May 13, 2022
Author

Hello,

I understand that there is not a sufficient number but I wanted to have an approx, right now I have the training dataset for NER by tokens. With about 15 stories which translates to +-50000 examples (each row is a token, to clarify).

Of course it happens to me that the next story I test has to at least have the entities I train with, right? Because otherwise it won't be able to recognize them. I have already tried this and that is why I am telling you what happens to me.

If I have tried even with the large model of news and it does not fit me to find the names of PERSON, some may take it and others may not.

Example con es_core_news_lg:
La jaula estaba terminada. Baltazar LOC la colgó en el alero, por la fuerza de la costumbre, y cuando acabó de almorzar ya se decía por todos lados que era la jaula más bella del mundo. Tanta gente vino a verla, que se formó un tumulto frente a la casa, y Baltazar LOCtuvo que descolgarla y cerrar la carpintería.

—Tienes que afeitarte —le dijo Úrsula MISC, su mujer—. Pareces un capuchino.

—Es malo afeitarse después del almuerzo —dijo Baltazar PER.
Only detect well with PERSON tag to last name. Therefore I have chosen to train it with my data.

If obviously the uploaded example of the Rep, what it does is count the number of times it detects that name in the text, but what happens to me is what I told you before that it detects Swimming as an entity when it really isn't, given what is it associating because it has a capital letter????.

polm May 19, 2022

I'm sorry I can't really give you better advice on this than that you need more data.

I understand that there is not a sufficient number but I wanted to have an approx, right now I have the training dataset for NER by tokens. With about 15 stories which translates to +-50000 examples (each row is a token, to clarify).

I'm a little unclear what you mean when you say each row is a token. When training spaCy, are you passing single words as examples? I assume that isn't what you're doing, but just in case - that won't work.

Examples given to spaCy should be complete sentences or paragraphs. If your example is a single word or short phrase, like the Character column in the excel sheet you gave me, the model will not be able to learn usefully.

If I have tried even with the large model of news and it does not fit me to find the names of PERSON, some may take it and others may not.

spaCy's models are trained on news, and not fiction, so I would expect it to have some issues. However I'm surprised if the example you gave is typical, I would expect it to do better than that. You are right that training it on your data is the right thing to do.

Out of curiosity, what kind of accuracy figures do you get from evaluation during training?

If obviously the uploaded example of the Rep, what it does is count the number of times it detects that name in the text, but what happens to me is what I told you before that it detects Swimming as an entity when it really isn't, given what is it associating because it has a capital letter????.

It's difficult to debug individual cases of classification in general, and basically impossible without full access to the model. So it's hard to say what's going on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The model is not learning #10724

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

The model is not learning #10724

Uh oh!

JessitaMS Apr 28, 2022

Replies: 1 comment · 5 replies

Uh oh!

polm May 3, 2022

Uh oh!

polm May 3, 2022

Uh oh!

JessitaMS May 11, 2022 Author

Uh oh!

polm May 13, 2022

Uh oh!

JessitaMS May 13, 2022 Author

Uh oh!

polm May 19, 2022

JessitaMS
Apr 28, 2022

Replies: 1 comment 5 replies

polm
May 3, 2022

JessitaMS May 11, 2022
Author

JessitaMS May 13, 2022
Author