Change the tokenizer of data already annotated with Prodigy. #11282

venti07 · 2022-08-09T07:25:31Z

venti07
Aug 9, 2022

I used Prodigy to annotate my own data for NER and used the standard tokenizer. Is it possible to change the tokenizer afterwards and adapt the annotated data to it?

I would like to use this tokenizer: https://github.com/ti250/cde2.1-ner-supplementary/blob/master/cde_removed/alt_tokenizers.py

This is specifically for chemical data. I hope someone can help me and maybe have an idea how to change the annotated data to the other tokenizer.

adrianeboyd · 2022-08-09T10:31:35Z

adrianeboyd
Aug 9, 2022

I'm not sure how you have integrated this, but note that this code doesn't appear to be a spacy tokenizer. A spacy tokenizer takes a str and returns a Doc.

In terms of your annotated data, the important part for NER is that the start char and the end char of the entity span fall on a token boundary. As long as that's true for your data, the tokenizer used while training spacy can be different from the tokenizer used while annotating.

You can check whether the NER annotation aligns with token boundaries with something like this:

import spacy
from spacy.tokens import DocBin

nlp = spacy.load("/path/to/model_with_modified_tokenizer")
doc_bin = DocBin().from_disk("train.spacy")
for doc in doc_bin.get_docs(nlp.vocab):
    retok_doc = nlp.make_doc(doc.text)
    for ent in doc.ents:
        span = retok_doc.char_span(ent.start_char, ent.end_char)
        if span is None:
            print("misaligned:", ent.text, "--", doc.text)

If it's only a few entities it's usually not a huge concern, since the NER component can automatically ignore these cases, but it's a large percentage of the entities, then it doesn't make sense to train with this data + this tokenizer.

2 replies

venti07 Aug 9, 2022
Author

Thanks for your feedback.

How would it be possible to retokenize the dataset then? This would probably require me to write custom code and then use their code snippet to evaluate the bounds?

Maybe I can ask a general question about this again: If I take a NER dataset from the internet and train a model with Spacy. Does it then matter which tokenizer rule was used to split the tokens? Because for example Bert models also use other tokens for learning and with a normal annotated NER data set I am also able to train a Bert model.

adrianeboyd Aug 9, 2022

For the NER model it mainly matters that spacy can align the automatic tokenization from the spacy tokenizer in the config with the starts and ends of the annotated entities in the training data.

Splitting or merging tokens that don't cross entity boundaries shouldn't matter for the alignment. (At some point the token-based tok2vec features are going to not be useful for very short or very long tokens, but the alignment part during training should continue to work.)

You could have cases that look like this where any sentence could be the training data or the predicted spacy tokenization. You should be able to train fine as long as the automatic spacy tokenization is relatively predictable.

[Emma/B-PER] [went to] [New York City/B-LOC]
[Emma/B-PER] [went] [to] [New/B-LOC] [York/I-LOC] [City/I-LOC]
[E/B-PER] [mma/I-PER] [we] [nt] [to] [New York City/B-LOC]

In contrast, this example would not work as training data because there's no way to tag partial tokens:

[Emma went] [to New] [York City]

And if the automatic spacy tokenization looked like this and you had an annotated training doc like in the first block of examples, then these tokens would be ignored during training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Change the tokenizer of data already annotated with Prodigy. #11282

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Change the tokenizer of data already annotated with Prodigy. #11282

Uh oh!

venti07 Aug 9, 2022

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Aug 9, 2022

Uh oh!

venti07 Aug 9, 2022 Author

Uh oh!

adrianeboyd Aug 9, 2022

venti07
Aug 9, 2022

Replies: 1 comment 2 replies

adrianeboyd
Aug 9, 2022

venti07 Aug 9, 2022
Author