How to use the components of the training pipeline with a ner annotated dataset? #10882

vmatter · 2022-05-31T03:21:39Z

vmatter
May 31, 2022

My training and test datasets were annotated for NER, so they contain only the sentences with the entities. However, I would like to know if there is a way for me to use different components of the training pipeline (e.g., tagger, morphologizer, trainable_lemmatizer, and others) for my evaluation.

When I try to run !python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy --gpu-id 0 with other pipelines besides ner, I receive always the same error (changing only the component):

"[E143] Labels for component 'morphologizer' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's initialize method."

Is there any way to configure a default "initialization" for these components or I won't be able to do this with my ner annotated documents?

To create my .spacy file I am using this conversion function (just so you can understand what annotation I am using) that I have created to convert the previous spaCy annotation to the new one:

def convert_old_to_new_spacy_format(data, data_type):
  nlp = spacy.blank("en") # load a new spacy model
  doc_bin_convert = DocBin() # create a DocBin object
  for text, annot in tqdm(data): # data in previous format
      doc_convert = nlp.make_doc(text) # create doc object from text
      ents = []
      for start, end, label in annot["entities"]: # add character indexes
          span = doc_convert.char_span(start, end, label=label, alignment_mode="contract")
          if span is None:
              print("Skipping entity")
          else:
              ents.append(span)
      doc_convert.ents = ents # label the text with the ents
      print(annot["entities"])
      doc_bin_convert.add(doc_convert)
  return doc_bin_convert.to_disk(f"./{data_type}.spacy") # save the docbin object

Is there a way to use some sort of default initialization that enables the use of other training pipeline components to evaluate these ner annotated datasets?

Many thanks!

Answered by polm

May 31, 2022

My training and test datasets were annotated for NER, so they contain only the sentences with the entities. However, I would like to know if there is a way for me to use different components of the training pipeline (e.g., tagger, morphologizer, trainable_lemmatizer, and others) for my evaluation.

Sorry, I don't understand what you want to do here. In evaluation the training pipeline is run as usual and compared to gold data (there is no "evaluation pipeline"). Also if you are evaluating performance on NER data and change the tagger, for example, that also won't affect your evaluation at all, since only NER scores matter.

Do you want to make a pipeline for making predictions that includ…

View full answer

polm · 2022-05-31T07:31:40Z

polm
May 31, 2022

My training and test datasets were annotated for NER, so they contain only the sentences with the entities. However, I would like to know if there is a way for me to use different components of the training pipeline (e.g., tagger, morphologizer, trainable_lemmatizer, and others) for my evaluation.

Sorry, I don't understand what you want to do here. In evaluation the training pipeline is run as usual and compared to gold data (there is no "evaluation pipeline"). Also if you are evaluating performance on NER data and change the tagger, for example, that also won't affect your evaluation at all, since only NER scores matter.

Do you want to make a pipeline for making predictions that includes pretrained components as well as the component you just trained? If so you can do that by sourcing components.

Maybe you could explain what you're trying to do from the start? I notice you've had several connected issues recently, it might help us to have some more perspective on your project.

To create my .spacy file I am using this conversion function (just so you can understand what annotation I am using) that I have created to convert the previous spaCy annotation to the new one:

I'm not sure from your sample code if this describes your situation, but if your JSON data is the spaCy v2 format then spacy convert can convert that directly.

5 replies

vmatter Jun 1, 2022
Author

I am developing a custom ner model to identify sensitive data (e.g., identity information and other data considered confidential).

So it is currently working pretty well for NER but I am not pretty sure about the results because my corpus is quite small since finding documents of this sort are quite difficult and I had to create the corpus using my personal documents plus generating other many variations. I am evaluating documents in Brazilian Portuguese but my initial evaluation was using spacy configured to English, which worked pretty well but I might be doing some sort of overfitting. 😅

The documents' entities related to confidential data are being identified inside my documents, however, I am right now trying different training customizations on layers, optimizers, and other variations to test spaCy's performance and find the best configuration for my model. If you have suggestions for tests to improve performance I would appreciate it, because right now I am trying random configurations that I think that could improve.

So I thought about starting my evaluation by trying the different components that can be preconfigured using the base_config.cfg file, which can be downloaded in the quickstart configuration.

Therefore, I have tried different compositions of components (tagger, morphologizer, all of them together, and specific ones mixed). In the end, I tried training my dataset with these different compositions but it always prompts the error that I have not initialized these components (it only works with NER and parser, so I think it might be a problem with my small dataset).

My corpus is composed of confidential and not confidential documents which I have annotated using Doccano. After I have finished annotating my corpus in Doccano, I exported the annotation as JSONL (which later I discovered that spaCy has this convert function for normal JSON, however it was too late, because I had already implemented two conversion functions to put my annotation is spaCy's format).

After that, I used the spacy init, spacy train, and spacy evaluate functions to test if my model was working properly. Using the base_config.cfg file with the ner option checked worked perfectly, however, with the other options, I did not manage to make it work.

Once I had finished identifying the confidential entities I evaluate the labels that were found in the document and by considering my confidential training set and not confidential training set I calculate if the tested document is confidential or not.

I was not able to figure out how to use the frequency of the labels only if they have appeared or not in the text, which right now is frustrating me because I could not find the best way to calculate this frequency, so I was only able to classify the document using the existence of confidential labels in the documents, but the amount of times that each of them appears is not being considered.

Well, I guess I have explained my overall project, and right now I am evaluating different configurations to try to improve the identification of confidential labels so that my confidentiality calculations work properly.

Sorry for the long text, however, I think that with this you can understand what I am developing.

Many thanks for answering my doubts so far.

polm Jun 1, 2022

Thanks for the extra background, that's helpful for clarifying things.

First, as to why you're getting the error. The issue is not that the component weights are not initialized, it's specifically the labels.

[E143] Labels for component 'morphologizer' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's initialize method.

The component can't train without labels to predict. For a POS tagger, you would need to provide POS labels. This can be done directly, though normally labels would be automatically inferred from training data. Especially for components with more than a few labels, like the ones you mention here, inferring the labels is often easier. There are also no default sets of labels, since the labels really need to match any data you're using.

Note that even if you had labels, adding an untrained component to your training pipeline without data for it doesn't make any sense, since it won't be able to train. (This is basically the short answer to your initial question about initialiation.)

You suggested that you want to add a pretrained or already-initialized morphologizer to your pipeline. But that won't do anything either. If you have an NER component and add a morphologizer component, the morphologizer predictions do not influence the NER component. So from the perspective of improving performance on your task, adding and removing different components doesn't make sense. (Technically the components can indirectly influence each other by sharing a tok2vec layer, but this only makes sense if you already have training data similar to your real target, and the effects are limited. Custom components can also depend on each other.)

The documents' entities related to confidential data are being identified inside my documents, however, I am right now trying different training customizations on layers, optimizers, and other variations to test spaCy's performance and find the best configuration for my model. If you have suggestions for tests to improve performance I would appreciate it, because right now I am trying random configurations that I think that could improve.

I understand it is hard to get data but getting more data is much more important than tuning, see the pyramid. If you don't have much data then any improvements in performance you see with better hyperparameters are likely to be overfitting and not translate to real data. It's not my specialty but I think there are some public corpora related to PII, so it might be worth spending more time looking for resources like that if you haven't done so already, or looking into data augmentation methods.

vmatter Jun 1, 2022
Author

Many thanks for the feedback, I must admit that I was a little confused if any other component would have a great impact on the ner results, but now I am sure that it usually won't, or if it will, the impact won't be that meaningful.

Regarding the PII corpora in Portuguese, that is something that I have searched everywhere, however, if it exists I guess it must be restricted because I was not able to find it. Probably I will have to expand mine and create the corpora myself (maybe removing my own personal data would be a good idea in the future 😁).

I am not very experienced with data augmentation methods, do you perhaps have any suggestions on where to look for libraries, techniques, or models? I found the NLPAug Library, are there others that you would suggest? If you know of course.

In this link I found some used techniques, and actually, by reading it I already did some sort of data augmentation in my corpus, which would be the data swapping with names, identifications and other data regarding the labels.

Once again many thanks.

polm Jun 1, 2022

I haven't used data augmentation for text seriously before, but I have seen nlpaug mentioned, and it seems to be a reasonable starting place.

For looking for research I like to use Semantic Scholar, which makes it easy to follow citation chains.

The developers of Gensim have done a lot of business work on PII removal, and their blog has some articles on it that may be interesting, though it hasn't been updated in a few years.

vmatter Jun 1, 2022
Author

Awesome!

I did not know this Semantic Scholar... I will try my luck searching my search string there, perhaps I will find something to help me regarding PII as well. And thanks for sharing the Gemsim blog, I knew only about their library, I will read them.

Many thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to use the components of the training pipeline with a ner annotated dataset? #10882

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to use the components of the training pipeline with a ner annotated dataset? #10882

Uh oh!

vmatter May 31, 2022

Replies: 1 comment · 5 replies

Uh oh!

polm May 31, 2022

Uh oh!

Uh oh!

vmatter Jun 1, 2022 Author

Uh oh!

polm Jun 1, 2022

Uh oh!

vmatter Jun 1, 2022 Author

Uh oh!

Uh oh!

polm Jun 1, 2022

Uh oh!

vmatter Jun 1, 2022 Author

vmatter
May 31, 2022

Replies: 1 comment 5 replies

polm
May 31, 2022

vmatter Jun 1, 2022
Author

vmatter Jun 1, 2022
Author

vmatter Jun 1, 2022
Author