NER Spacy custom tokenisation #11168

marzooq-unbxd · 2022-07-20T04:30:30Z

marzooq-unbxd
Jul 20, 2022

I want to detect dimensions within sentences with custom ner model. I have examples like: 1.'15.6 inch(39.6 cm) laptop' ->> Explaining with tokeniser (blank english spacy )

[('TOKEN', '15.6'),
 ('TOKEN', 'inch(39.6'),
 ('TOKEN', 'cm'),
 ('SUFFIX', ')'),
 ('TOKEN', 'laptop')]

2.'Mobile phone 6GB+128GB storage' ->>

[('TOKEN', 'Mobile'),
 ('TOKEN', 'phone'),
 ('TOKEN', '6GB+128'),
 ('SUFFIX', 'GB'),
 ('TOKEN', 'storage')]

When I try to use DocBin with both these sentences, I get

"ValueError: [E1010] Unable to set entity information for token<no> which is included in more than one span in entities, blocked, missing or outside."

This error usually comes when you have one token given multiple entities (multiple spans).But this is clearly not the case here. I have the entity and their annotations, (text, start,end )

In first example , "15.6 inch" is an entity,"39.6 cm" is an entity ("15.6 inch",0,9),("39.6 cm",10,17)
Second example. "6GB" is an entity,"128GB" is an entity ("6GB",13,16),("128GB",17,22)

But we do have characters which spacy does not usually split on, For eg: "(" and "+"

So does it make sense to just prepend them to the list of infixes?

infixes = ([r"\b$\b"])+ [r"\b$\b"] +([r"\+"]) + nlp.Defaults.infixes
Is this the right way to do this, or should I add entity-rulers/matchers?

One more thing I would have to do is to remove the suffixes from the spacy list. Spacy has 'GB' as a suffix inbuilt.If I remove 'GB' from the list, this would mean '128gb' would not be split .

A question I still have no clue is :Given I can provide the annotations which are not conflicting, why cant spacy tokenise everything accordingly for BIO/BILOU format ?Do the entities that I input compulsory need to be tokenised as TOKEN ?Can it not be split into TOKEN and SUFFIX by the tokeniser.

I feel this discussion is closely related. #10331

Answered by polm

Jul 21, 2022

It should be fine to add those characters as infixes, or otherwise modify the tokenizer to get the tokens you need.

It's important to understand that the tokenizer doesn't know anything about your entity annotations. Entity annotations are applied after the tokenizer has already done its work, which is why you have issues if your annotations aren't whole tokens. For the same reason, adding EntityRulers or Matchers will not change tokenization and will not fix your problem.

Consider the case where you have new data coming in. For new data you don't have entity annotations yet. If you expect spaCy to use your annotations at training time, how would it be able to get the same tokenization wi…

View full answer

polm · 2022-07-21T05:32:58Z

polm
Jul 21, 2022

It should be fine to add those characters as infixes, or otherwise modify the tokenizer to get the tokens you need.

It's important to understand that the tokenizer doesn't know anything about your entity annotations. Entity annotations are applied after the tokenizer has already done its work, which is why you have issues if your annotations aren't whole tokens. For the same reason, adding EntityRulers or Matchers will not change tokenization and will not fix your problem.

Consider the case where you have new data coming in. For new data you don't have entity annotations yet. If you expect spaCy to use your annotations at training time, how would it be able to get the same tokenization without those annotations, like on raw data?

Entity annotations have to apply to whole tokens because the NER component predicts an entity label for each token - it can't predict a label for half a token.

8 replies

marzooq-unbxd Jul 21, 2022
Author

Technically you can add a component that splits tokens using the retokenizer, but given your example text I would suggest you just change the tokenizer settings and then apply your annotations - that will be faster and less complicated.

Ok,I will try using more infixes,
In the ideal case, I should be using a retokenizer (split) and adding entitites to them , correct? I am not sure if entities is the right word.I feel adding some sub-entity information for the word 'cm' would help.The reason I am confused is I want to detect '15cmx30cm' as an entity, but I want it to recognise that 15cm was actually an entity in itself, but since a larger one exists, we can ignore it.
Like for example , I would add a matcher for 'x' and give it an 'entity_type' of 'separator',
and I would give 'cm'or'15cm'matcher ,and give it an 'entity_type' of 'size_measurement'.
Wouldnt this help the model better learn the interdependencies?

polm Jul 24, 2022

If the whole string is an entity, it doesn't matter how it's tokenized internally. You only get an error or have a problem if you try to tag part of a single token rather than a whole token.

When you create a Span, for example by using doc.char_span(0, len(doc.text), "size"), spaCy takes care of the entity labels. So you would be able to correctly export BIO tags.

The correct BIO tags for your example would be:

15cm B-size
x   I-size
30cm I-size

Note that if your training data is all just entities on a line by themselves that isn't useful - you should have complete sentences that look like your real incoming data. (It's OK to have entities by themselves sometimes, but it shouldn't be all your training data.)

About your questions about which tokenization is better for accuracy, or whether you should use a "separator" entity to help the model learn - I'm not really sure. When you have a question like "is it better to do X or Y", you should usually try both strategies and see how performance differs.

I feel adding some sub-entity information for the word 'cm' would help.The reason I am confused is I want to detect '15cmx30cm' as an entity, but I want it to recognise that 15cm was actually an entity in itself, but since a larger one exists, we can ignore it.

For an NER model it's important you be consistent in your annotations. You can't have nested annotations with the NER component in spaCy, but you can have them if you use the spancat component.

marzooq-unbxd Aug 3, 2022
Author

@polm ,
I have ensured that the words in the sentence are tokenized correctly now.Entities are either single tokens or spanned across multiple tokens,The model i have trained works , but still misses out some easy cases.
Like , sometimes it is able to detect '30 cm', but not '30cm'.
Will an entity ruler or matcher help , adding to the existing ner?(added before but)
I want to be precise and not detect junk like 'BC50CM4b' as a size measurement.
The reason I have not tried it yet is I will have to add patterns for '15cmx15cm','over 15cm' etc.
the current tokeniser does not have any infix for alphabet-digit boundaries...But using entity rulers would mean I would necessarily have to change my tokeniser , correct? And that would mean ,there is a chance of 'BC50CM4b' coming up as an entity.
(I do have training examples covering the negative size measurement)
One last doubt.Just to confirm, I cant use the CLI to train the ner if I am having a pipeline of ner and entity ruler,correct?

polm Aug 4, 2022

It sounds like sometimes your documents have 30 cm (space) and 30cm (no space). If you want to fix that one way is to tag examples of both kinds and let your model learn it. You can also use an EntityRuler for simple unit expressions like that.

You don't have to change your tokenizer to use an EntityRuler. If you configure your EntityRuler correctly, like with "REGEX": "^[0-9]+cm$", there is no reason it would match a string like BC50CM4b.

One last doubt.Just to confirm, I cant use the CLI to train the ner if I am having a pipeline of ner and entity ruler,correct?

You can always use the command line to train. You can initialize an EntityRuler in the config by providing a patterns file in the config, like shown here. Another thing you can do is create the EntityRuler in code, serialize that pipeline to disk, and then source the EntityRuler into your training pipeline.

Sumit5194 Jan 11, 2023

how your output entity looks like? your output is containing B and I?

Uh oh!

NER Spacy custom tokenisation #11168

Uh oh!

marzooq-unbxd Jul 20, 2022

Replies: 1 comment · 8 replies

Uh oh!

polm Jul 21, 2022

Uh oh!

Uh oh!

marzooq-unbxd Jul 21, 2022 Author

Uh oh!

Uh oh!

polm Jul 24, 2022

Uh oh!

Uh oh!

marzooq-unbxd Aug 3, 2022 Author

Uh oh!

polm Aug 4, 2022

Uh oh!

Sumit5194 Jan 11, 2023

marzooq-unbxd
Jul 20, 2022

Replies: 1 comment 8 replies

polm
Jul 21, 2022

marzooq-unbxd Jul 21, 2022
Author

marzooq-unbxd Aug 3, 2022
Author