ent.sent.text in SpaCy returns labels instead of the sentence for NER problem #12639
-
I'm trying to solve a Name Entity Recognision(NER) Problem using SpaCy of the PDF files. I want to get the modal verbs(will, shall, should, must, etc..) from the pdf files. I trained the data in spacy. The ent.sent.text usualy returns the text from which the label extracted. But in my case it returns the label itself. Anyone help me please. The codes are giving below: Code for data preparation
Training the model
Calling the functions
Predicting using model (here is the problem starting)
Train Data
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Just for my comprehension of the training data. E.g., the the row is visually:
So |
Beta Was this translation helpful? Give feedback.
-
Ah, I see what you mean now. If you print the sentence boundaries that the trained model detects, you will see that it splits each token in a sentence by itself:
The reason is that you are calling
which will also reinitialize all models. As a result, the parser (which performs the sentence splitting), will predict the sentence boundaries using a zeroed-out softmax layer and will start detecting a boundary after every token. So, you should remove the line that calls
By the way, you'll probably want to batch multiple training examples in an
It doesn't pick up will yet, but that's probably because I only had two training examples. Also, training an existing NER pipe on a new data set will probably lead to catastrophic forgetting (meaning that the model will forget how to annotate the entities it was originally trained for). |
Beta Was this translation helpful? Give feedback.
Ah, I see what you mean now. If you print the sentence boundaries that the trained model detects, you will see that it splits each token in a sentence by itself:
The reason is that you are calling
which will also reinitialize all models. As a result, the parser (which performs the sentence splitting), will predict the sentence boundaries using a zeroed-out softmax layer and will start detecting a boundary after every token. So, you should remove the line that calls
begin_training
. Then later when you update the pipe, you can remove thesgd
parameter and the pipe will create an optimizer internally: