Is iterating through a doc a bad way to manually set my labels? #9366

Vfgandara · 2021-10-04T19:19:09Z

Vfgandara
Oct 4, 2021

I couldn't formulate the title in a better way, I'm really sorry.

Basically, the way I've been generating all my train data for my NER model is through a Python script that uses doc.set_ents(List[Span]) to set a doc's entities and just iterates through each doc in a doc list trying to match each of the doc's tokens to some REGEX/logical rules I've created and append the matched Spans to the list that will be passed to doc.set_ents().
After a lot of trial and error I've been able to generate enough data to be able to train a model that I'm really happy with (although some of the data still raises errors when calling doc.set_ents()) . But one thing that makes me unsure of what I'm doing is that I'm iterating over a list of docs, so basically they've already been passed through a model. I did that so that I could use SpaCy's doc slicing syntax instead of having to work with python's default way of slicing strings (character by character) and also to make sure that all my Spans would always match when passed to my model while training.

The thing is, is there a problem with working like that? All the tutorials I have seen (mostly from SpaCy 2.x) used doc.char_span to create the spans or used annotations upon the creation of the docs, so that they would already have the correct labels upon creation. Are there reasons not to do that the way I'm doing? I'm mostly worried that my docs will pass on to my training data too many annotations (since I haven't removed any pipeline from the base model, "pt_core_news_lg", which I'm using to generate the training docs) and that it may be unnecessary or straight up damaging. But on the other hand I don't know if not passing them couldn't also be bad, since my model is using the annotations from all the previous pipes (which are frozen btw) to train the NER pipe (which is the only pipe that's not frozen and is also the last pipe).

I'm still trying to get the hang of things in SpaCy, so I'm sorry if the questions doesn't make that much sense or if I made a lot of assumptions that don't make sense at all and bad usage of the technical-lingo.

Answered by ljvmiranda921

Oct 5, 2021

Hi @Vfgandara ,

...After a lot of trial and error I've been able to generate enough data to be able to train a model that I'm really happy with (although some of the data still raises errors when calling doc.set_ents())

I'm curious as to what kind of errors you're getting when calling doc.set_ents(). Just be careful as it sounds like there are overlapping spans in the dataset/annotations. If you can paste a traceback, I'd appreciate that!

...The thing is, is there a problem with working like that?

For set_ents() vs. char_span, there shouldn't be any effect in the accuracy if that's what you meant by damaging. Although if you're just using the tokenizer, I'd suggest starting off from a…

View full answer

ljvmiranda921 · 2021-10-05T05:18:01Z

ljvmiranda921
Oct 5, 2021

Hi @Vfgandara ,

...After a lot of trial and error I've been able to generate enough data to be able to train a model that I'm really happy with (although some of the data still raises errors when calling doc.set_ents())

I'm curious as to what kind of errors you're getting when calling doc.set_ents(). Just be careful as it sounds like there are overlapping spans in the dataset/annotations. If you can paste a traceback, I'd appreciate that!

...The thing is, is there a problem with working like that?

For set_ents() vs. char_span, there shouldn't be any effect in the accuracy if that's what you meant by damaging. Although if you're just using the tokenizer, I'd suggest starting off from a blank model to get some speed. You can also check out the Sentencizer and see if that works for you.

...since my model is using the annotations from all the previous pipes (which are frozen btw)

Just a small correction, when pipes are frozen, they're not affecting training (nor being affected by it). You can check more in the training docs.

5 replies

svlandeg Oct 6, 2021

Just a small note on top of what Lj has already explained:

when pipes are frozen, they're not affecting training (nor being affected by it)

Frozen pipes might still set predictions in the training loop (cf https://spacy.io/usage/training#annotating-components) without being trained themselves.

Vfgandara Oct 6, 2021
Author

Thanks for the answer, it helped a lot!

I checked again what the TraceBack were and it's only [E1010] Unable to set entity information for token 21 which is included in more than one span in entities, blocked, missing or outside. and on top of that in only one data instance. But that's really not a problem since it's really my labeling script's inability to deal with this instances of really troublesome data (which I'll end up just removing anyway since it's really bad quality data).

About starting off with a blank model. I have some doubts/things I still haven't understood about the training process in general, and I think they kind of add up to this "blank model" question, so I'll try to understand them first:

If I have frozen some components in my pipeline (as @svlandeg mentioned) they'll still generate predictions for the training loop, but how do I pass these annotations/predictions forward so that my next pipes can use them? Is it by using the annotating_components part of the config file? And in a more 'under the hood' level, that means that these annotations we'll get inputed into the next prediction model as the pipeline goes forward to help it get better at predicting the next components? If so, if only my NER pipe is unfrozen, it means that only this pipe will be trained, but it will still have access to the annotated components from the previous pipe and will be able to use them to get some better predictions, right?
What is the tok2vec listener? If I don't add it to my all my other pipes does it mean that they won't get to use the word2vec way of displaying words as vectors when they're making their predictions?
Finally, about the blank model. If I use a blank model to tokenize my raw corpus text in order to use my labelling script, once I save the outputed doc into a DocBin and then save it to this, since it's a blank model it won't have annotations, will it? And if so, won't it technically generate data whose quality is worse than if it had these annotation? I mean, as far as I can understand my unfrozen NER component that I'm traning will probably rely on some of these annotations, right? Or is it not necessary because once in training my data will get through the rest of the pipeline before reaching NER and will by then have it's proper annotations?

Sorry for bombarding you with questions, I'm just starting with NLP and I think I'll use spaCy a lot and I'm loving it, so I'm really eager to learn it as much as I can :)

svlandeg Oct 15, 2021

If I have frozen some components in my pipeline (as @svlandeg mentioned) they'll still generate predictions for the training loop, but how do I pass these annotations/predictions forward so that my next pipes can use them? Is it by using the annotating_components part of the config file? And in a more 'under the hood' level, that means that these annotations we'll get inputed into the next prediction model as the pipeline goes forward to help it get better at predicting the next components? If so, if only my NER pipe is unfrozen, it means that only this pipe will be trained, but it will still have access to the annotated components from the previous pipe and will be able to use them to get some better predictions, right?

That's right. Technically, what happens under the hood is that "annotating components", when frozen, do create predictions and store these on the Doc, just like they would when running predictions. The next component in the pipeline can then access these predictions from the Doc.

What is the tok2vec listener? If I don't add it to my all my other pipes does it mean that they won't get to use the word2vec way of displaying words as vectors when they're making their predictions?

Many of our builtin components have a tok2vec layer in their model. This Tok2Vec layer can be specific to the component, in which case it runs as part of the ML model of that component, and its embeddings aren't shared with other components.

Alternatively, you can have a tok2vec component in the pipeline, and downstream components can "listen" to that to retrieve the embeddings and feed it back gradients for backpropagation. This system allows you to have multiple downstream components all listening to the same tok2vec, thus sharing embeddings. Have a look at the docs here: https://spacy.io/usage/embeddings-transformers#embedding-layers

Finally, about the blank model. If I use a blank model to tokenize my raw corpus text in order to use my labelling script, once I save the outputed doc into a DocBin and then save it to this, since it's a blank model it won't have annotations, will it? And if so, won't it technically generate data whose quality is worse than if it had these annotation? I mean, as far as I can understand my unfrozen NER component that I'm traning will probably rely on some of these annotations, right? Or is it not necessary because once in training my data will get through the rest of the pipeline before reaching NER and will by then have it's proper annotations?

I'm afraid I don't fully understand the question. A blank model is a pipeline that just does tokenization. The resulting Doc will be tokenized, but won't have any NER annotations or such.

Vfgandara Oct 15, 2021
Author

Thanks a lot for the answers!

Just to be 100% sure, if I add my components to the annotating_components their predicted annotations not only will be passed on downstream but will also be used by the next pipes to help them in their predictions? Like, would a pos-tagger annotations help my NER-model predict it's annotations (if I've added the pos-tagger to the annotating components of course)?

Thanks a lot for the tok2vec explanation, I think I've finally understood it!

About the last part, I totally butchered my explanation, I'm sorry. What I was trying to ask is basically whether I should use a pre-trained model with all it's predictions when generating my train data or use a more lightweight model with just the basics I would need for generating the train data. For example, I'm creating data for a NER model, should I just use a tok2vec and a EntityRuler or would it make more sense to use a pretrained model (like "en_core_web_lg") and swap it's NER-model with my EntityRuler?
This question is kind of related to the first one actually, because in the end I'm just curious if adding the predictions from the pretrained model to the training data will help my future NER model train better (since technically it would have access to some useful annotations like pos tag for example). I know that it's not guaranteed that this extra info will help it at all, but I'm more curious whether it will be able to use these annotations while training the actual NER pipe from my model.

polm Oct 17, 2021

since technically it would have access to some useful annotations like pos tag for example

POS tags and annotations from other components aren't used by the NER models. They only use the input of the tok2vec.

If you have a POS and NER model sharing a tok2vec that will encourage the tok2vec to learn mutually useful representations, so they'll communicate indirectly, but just having the annotations in your training data doesn't do anything.

Uh oh!

Is iterating through a doc a bad way to manually set my labels? #9366

Uh oh!

Vfgandara Oct 4, 2021

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

ljvmiranda921 Oct 5, 2021

Uh oh!

svlandeg Oct 6, 2021

Uh oh!

Vfgandara Oct 6, 2021 Author

Uh oh!

Uh oh!

svlandeg Oct 15, 2021

Uh oh!

Vfgandara Oct 15, 2021 Author

Uh oh!

Uh oh!

polm Oct 17, 2021

Vfgandara
Oct 4, 2021

Replies: 1 comment 5 replies

ljvmiranda921
Oct 5, 2021

Vfgandara Oct 6, 2021
Author

Vfgandara Oct 15, 2021
Author