ent.sent.text in SpaCy returns labels instead of the sentence for NER problem #12639

ananduaji · 2023-05-16T15:30:08Z

ananduaji
May 16, 2023

I'm trying to solve a Name Entity Recognision(NER) Problem using SpaCy of the PDF files. I want to get the modal verbs(will, shall, should, must, etc..) from the pdf files. I trained the data in spacy. The ent.sent.text usualy returns the text from which the label extracted. But in my case it returns the label itself. Anyone help me please.

The codes are giving below:

Code for data preparation

def load_training_data_from_csv(file_path):
    nlp = spacy.load('en_core_web_md')
    train_data = []
    with open(file_path, 'r', encoding='cp1252') as f:
        reader = csv.DictReader(f)
        for row in reader:
            sentence = row['text']
            start, end = int(row['start']), int(row['end'])
            label = row['label']
            train_data.append((sentence, {"entities": [(start, end, label)]}))
            # Check the alignment
            from spacy.training import offsets_to_biluo_tags
            doc = nlp.make_doc(sentence)
            tags = offsets_to_biluo_tags(doc, [(start, end, label)])
            if '-' in tags:
                print(f"Warning: Misaligned entities in '{sentence}' with entities {[(start, end, label)]}")
    return train_data

Training the model

def train_spacy_ner(train_data, n_iter=10):
    # Load the existing model
    nlp = spacy.load('en_core_web_md')

    # Add the NER pipeline if it doesn't exist
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    else:
        ner = nlp.get_pipe("ner")


    # Add the new label "CURRENCY" to the NER model
    ner.add_label("WILL")
    ner.add_label("SHALL")
    ner.add_label("MUST")


    # Train the NER model
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        print("Epoch - ", i) if i % 2 == 0 or i == n_iter else None
        random.shuffle(train_data)
        losses = {}
        for text, annotations in train_data:
            doc = nlp.make_doc(text)
            example = spacy.training.Example.from_dict(doc, annotations)
            nlp.update([example], sgd=optimizer, losses=losses)
        print("loss : ", losses) if i % 2 == 0 or i == n_iter else None

    return nlp

Calling the functions

# nlp = spacy.load("en_core_web_md")
file_path = "/content/trainData.csv"
TRAIN_DATA = load_training_data_from_csv(file_path)

# Train the model
nlp = train_spacy_ner(TRAIN_DATA)
nlp.to_disk('custom_NER')

Predicting using model (here is the problem starting)

import spacy

nlp = spacy.load('custom_NER')
text = "The language will be in english"

doc = nlp(text)
# print(doc.ents)
for ent in doc.ents:
  print(ent.sent, ent.start_char, ent.end_char, ent.label_)
  print(ent.sent.text, ent.label_)

ent.sent.text should return the sentence which the label fetched but giving the label itself.

Train Data

Text	start	end	label
I will do the procedures	2	6	will
You should send the letters	4	10	should

Answered by danieldk

May 17, 2023

Ah, I see what you mean now. If you print the sentence boundaries that the trained model detects, you will see that it splits each token in a sentence by itself:

>>> list(doc.sents)
[The, language, will, be, in, english]

The reason is that you are calling

optimizer = nlp.begin_training()

which will also reinitialize all models. As a result, the parser (which performs the sentence splitting), will predict the sentence boundaries using a zeroed-out softmax layer and will start detecting a boundary after every token. So, you should remove the line that calls begin_training. Then later when you update the pipe, you can remove the sgd parameter and the pipe will create an optimizer internally:

View full answer

danieldk · 2023-05-17T06:36:37Z

danieldk
May 17, 2023

Just for my comprehension of the training data. E.g., the the row is visually:

You should send the letters
    ^     ^
    |     |
  start  end

So ent.sent.text is giving you the text, it just happens to be the same as the label?

1 reply

ananduaji May 17, 2023
Author

no, ent.sent.text usually returns the text used for NER. ent.label_ will give the label name.
Don't understand why is coming ike this.

danieldk · 2023-05-17T10:48:41Z

danieldk
May 17, 2023

Ah, I see what you mean now. If you print the sentence boundaries that the trained model detects, you will see that it splits each token in a sentence by itself:

>>> list(doc.sents)
[The, language, will, be, in, english]

The reason is that you are calling

optimizer = nlp.begin_training()

which will also reinitialize all models. As a result, the parser (which performs the sentence splitting), will predict the sentence boundaries using a zeroed-out softmax layer and will start detecting a boundary after every token. So, you should remove the line that calls begin_training. Then later when you update the pipe, you can remove the sgd parameter and the pipe will create an optimizer internally:

nlp.update([example], losses=losses)

By the way, you'll probably want to batch multiple training examples in an update call. Output after making these changes:

$ python test.py
The language will be in english 24 31 LANGUAGE
The language will be in english LANGUAGE

It doesn't pick up will yet, but that's probably because I only had two training examples. Also, training an existing NER pipe on a new data set will probably lead to catastrophic forgetting (meaning that the model will forget how to annotate the entities it was originally trained for).

1 reply

ananduaji May 17, 2023
Author

Thanks brother. Now its working :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ent.sent.text in SpaCy returns labels instead of the sentence for NER problem #12639

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

ent.sent.text in SpaCy returns labels instead of the sentence for NER problem #12639

Uh oh!

Uh oh!

ananduaji May 16, 2023

Code for data preparation

Training the model

Calling the functions

Predicting using model (here is the problem starting)

Train Data

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

danieldk May 17, 2023

Uh oh!

Uh oh!

ananduaji May 17, 2023 Author

Uh oh!

Uh oh!

danieldk May 17, 2023

Uh oh!

ananduaji May 17, 2023 Author

ananduaji
May 16, 2023

Replies: 2 comments 2 replies

danieldk
May 17, 2023

ananduaji May 17, 2023
Author

danieldk
May 17, 2023

ananduaji May 17, 2023
Author