Tokenization Issue when loading conll #10104
-
When I am loading a conllpp dataset using a modified version of Is there a way to retokenize an exiting Doc using the model and recalculating ents spans without predicting the labels? Additional Contex# model creation
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("sentencizer")
# text extraction
texts = [doc.text for doc in docs]
# predicting
predictions = list(
tqdm(nlp.pipe(texts), total=len(texts), desc="Predicting")
) results in different tokenization e.g. Dataset: [FREESTYLE, SKIING-WORLD, CUP, ...] I think this happens almost all the time with hyphenated words. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
You might start by looking at the docs for the tokenizer, there's an example in there on how to remove hyphens as infix operators. |
Beta Was this translation helpful? Give feedback.
You might start by looking at the docs for the tokenizer, there's an example in there on how to remove hyphens as infix operators.