Change the tokenizer of data already annotated with Prodigy. #11282
Replies: 1 comment 2 replies
-
I'm not sure how you have integrated this, but note that this code doesn't appear to be a spacy tokenizer. A spacy tokenizer takes a In terms of your annotated data, the important part for NER is that the start char and the end char of the entity span fall on a token boundary. As long as that's true for your data, the tokenizer used while training spacy can be different from the tokenizer used while annotating. You can check whether the NER annotation aligns with token boundaries with something like this: import spacy
from spacy.tokens import DocBin
nlp = spacy.load("/path/to/model_with_modified_tokenizer")
doc_bin = DocBin().from_disk("train.spacy")
for doc in doc_bin.get_docs(nlp.vocab):
retok_doc = nlp.make_doc(doc.text)
for ent in doc.ents:
span = retok_doc.char_span(ent.start_char, ent.end_char)
if span is None:
print("misaligned:", ent.text, "--", doc.text) If it's only a few entities it's usually not a huge concern, since the NER component can automatically ignore these cases, but it's a large percentage of the entities, then it doesn't make sense to train with this data + this tokenizer. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I used Prodigy to annotate my own data for NER and used the standard tokenizer. Is it possible to change the tokenizer afterwards and adapt the annotated data to it?
I would like to use this tokenizer: https://github.com/ti250/cde2.1-ner-supplementary/blob/master/cde_removed/alt_tokenizers.py
This is specifically for chemical data. I hope someone can help me and maybe have an idea how to change the annotated data to the other tokenizer.
Beta Was this translation helpful? Give feedback.
All reactions