Warning of misaligned tokens in 'spacy debug data' #8843
-
Hello, When I use 'spacy debug data' to check the data quality, I got warnings like
I do have added un suffix rule to separate '-' at the end of the words by using the callback in the training config file, but my training data (the DocBin object) is also tokenized in the same way. It shouldn't have this warning. Here is my function callback (modified for the demo) and how I use it to generate DocBin object. The function callback: from spacy.util import registry, compile_suffix_regex
def update_tokenizer(nlp):
custom_suffixes = r'[-\+]$'
suffix_re = compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [custom_suffixes]))
nlp.tokenizer.suffix_search = suffix_re.search
@registry.callbacks("custom_tokenizer")
def create_custom_tokenizer():
return update_tokenizer How I create DocBin: text = "This is a test doc- for the demo+"
output_path = 'output/path'
nlp = spacy.blank('fr')
update_tokenizer(nlp)
doc_bin = DocBin()
doc = nlp.make_doc(text)
doc.cats = {"CatName": True}
doc_bin.add(doc)
doc_bin.to_disk(output_path) When I call I have checked in the source code of spacy, this is cause by the alignment check. For the words like 'doc-' and 'demo+' in the test text above, How ever, when I check the tokens in Is that a bug or do I miss something important ? Thanks in advance for your response. [Environment info] |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
This should have been fixed in #8776. Let's see, that was just after v3.1.1 so I don't think it's in a released version yet. |
Beta Was this translation helpful? Give feedback.
This should have been fixed in #8776. Let's see, that was just after v3.1.1 so I don't think it's in a released version yet.