NER Spacy custom tokenisation #11168
-
I want to detect dimensions within sentences with custom ner model. I have examples like: 1.'15.6 inch(39.6 cm) laptop' ->> Explaining with tokeniser (blank english spacy )
2.'Mobile phone 6GB+128GB storage' ->>
When I try to use DocBin with both these sentences, I get
This error usually comes when you have one token given multiple entities (multiple spans).But this is clearly not the case here. I have the entity and their annotations, (text, start,end ) In first example , "15.6 inch" is an entity,"39.6 cm" is an entity But we do have characters which spacy does not usually split on, For eg: "(" and "+" So does it make sense to just prepend them to the list of infixes?
One more thing I would have to do is to remove the suffixes from the spacy list. Spacy has 'GB' as a suffix inbuilt.If I remove 'GB' from the list, this would mean '128gb' would not be split . A question I still have no clue is :Given I can provide the annotations which are not conflicting, why cant spacy tokenise everything accordingly for BIO/BILOU format ?Do the entities that I input compulsory need to be tokenised as TOKEN ?Can it not be split into TOKEN and SUFFIX by the tokeniser. I feel this discussion is closely related. #10331 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
It should be fine to add those characters as infixes, or otherwise modify the tokenizer to get the tokens you need. It's important to understand that the tokenizer doesn't know anything about your entity annotations. Entity annotations are applied after the tokenizer has already done its work, which is why you have issues if your annotations aren't whole tokens. For the same reason, adding EntityRulers or Matchers will not change tokenization and will not fix your problem. Consider the case where you have new data coming in. For new data you don't have entity annotations yet. If you expect spaCy to use your annotations at training time, how would it be able to get the same tokenization without those annotations, like on raw data? Entity annotations have to apply to whole tokens because the NER component predicts an entity label for each token - it can't predict a label for half a token. |
Beta Was this translation helpful? Give feedback.
It should be fine to add those characters as infixes, or otherwise modify the tokenizer to get the tokens you need.
It's important to understand that the tokenizer doesn't know anything about your entity annotations. Entity annotations are applied after the tokenizer has already done its work, which is why you have issues if your annotations aren't whole tokens. For the same reason, adding EntityRulers or Matchers will not change tokenization and will not fix your problem.
Consider the case where you have new data coming in. For new data you don't have entity annotations yet. If you expect spaCy to use your annotations at training time, how would it be able to get the same tokenization wi…