Tokenization Issue when loading conll #10104

maximilianreimer · 2022-01-20T16:35:14Z

maximilianreimer
Jan 20, 2022

When I am loading a conllpp dataset using a modified version of spacy.training.converters.conll_ner_to_docs.conll_ner_to_docs - to support the slightly different doc delimiters - and then use the model (see below) to predict ner, I get different tokenization leading to a mismatch of the span in the ents.

Is there a way to retokenize an exiting Doc using the model and recalculating ents spans without predicting the labels?

Additional Contex

# model creation
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("sentencizer")

# text extraction 
texts = [doc.text for doc in docs]
# predicting
predictions = list(
    tqdm(nlp.pipe(texts), total=len(texts), desc="Predicting")
)

results in different tokenization e.g.

Dataset: [FREESTYLE, SKIING-WORLD, CUP, ...]
Prediction: [FREESTYLE, SKIING, -, WORLD, CUP, ...]

I think this happens almost all the time with hyphenated words.

Answered by pmbaumgartner

Jan 21, 2022

You might start by looking at the docs for the tokenizer, there's an example in there on how to remove hyphens as infix operators.

View full answer

pmbaumgartner · 2022-01-21T01:48:27Z

pmbaumgartner
Jan 21, 2022

You might start by looking at the docs for the tokenizer, there's an example in there on how to remove hyphens as infix operators.

6 replies

pmbaumgartner Jan 24, 2022

Do you have a minimal working example of what you're trying to accomplish? Particularly one with more examples of the input and output data. If you can provide that, it'll be easier to help since we'll know exactly what data is being input, what's being output, and how that's different than what you expect.

maximilianreimer Jan 24, 2022
Author

sure! This reproduces the error:

import spacy

from spacy.training.converters.conll_ner_to_docs import conll_ner_to_docs
from spacy.training.example import Example

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("sentencizer")


with open(
    "test_data.txt",
    "r",
) as f:
    content = f.read()
data = list(conll_ner_to_docs(content, n_sents=0))

texts = [example.text for example in data]
predictions = list(nlp.pipe(texts))

sample = data[0]

prediction = predictions[0]

example = Example(prediction, sample)
print(example.get_aligned_ner())

when using this data: test_data.txt

pmbaumgartner Jan 25, 2022

The issue is that the two docs in Example need to have the same vocab. The trick to do this is to reload your docs with the vocab from the transformer model. The following snippet should work:

texts = [example.text for example in data]
predictions = list(nlp.pipe(texts))

doc_bin = spacy.tokens.DocBin(docs=data)
reloaded_docs = list(doc_bin.get_docs(nlp.vocab))

sample = reloaded_docs[0]
prediction = predictions[0]

example = Example(prediction, sample)
print(example.get_aligned_ner())

maximilianreimer Jan 25, 2022
Author

Thanks for the reply. That indeed fixes the issue. But it is a little inefficient to first load the doc and then reload them again. I think I will create my own version of the reader function that accepts the NLP object directly. Do you think there is interest in taking this addition into the spacy code base? If so I could file a pull request afterwards.

adrianeboyd Jan 25, 2022

Sure, that would be useful! It should look the same as the model kwarg used by several other converters, since that's already set up as part of the CLI args.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenization Issue when loading conll #10104

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tokenization Issue when loading conll #10104

Uh oh!

maximilianreimer Jan 20, 2022

Additional Contex

Replies: 1 comment · 6 replies

Uh oh!

pmbaumgartner Jan 21, 2022

Uh oh!

pmbaumgartner Jan 24, 2022

Uh oh!

maximilianreimer Jan 24, 2022 Author

Uh oh!

pmbaumgartner Jan 25, 2022

Uh oh!

Uh oh!

maximilianreimer Jan 25, 2022 Author

Uh oh!

adrianeboyd Jan 25, 2022

maximilianreimer
Jan 20, 2022

Replies: 1 comment 6 replies

pmbaumgartner
Jan 21, 2022

maximilianreimer Jan 24, 2022
Author

maximilianreimer Jan 25, 2022
Author