Warning of misaligned tokens in 'spacy debug data' #8843

Pandalei97 · 2021-07-29T08:57:39Z

Pandalei97
Jul 29, 2021

Hello,

When I use 'spacy debug data' to check the data quality, I got warnings like

============================== Vocab & Vectors ==============================
ℹ 26076 total word(s) in the data (4682 unique)
⚠ 67 misaligned tokens in the training data
⚠ 44 misaligned tokens in the dev data

I do have added un suffix rule to separate '-' at the end of the words by using the callback in the training config file, but my training data (the DocBin object) is also tokenized in the same way. It shouldn't have this warning.

Here is my function callback (modified for the demo) and how I use it to generate DocBin object.

The function callback:

from spacy.util import registry, compile_suffix_regex

def update_tokenizer(nlp):
    custom_suffixes = r'[-\+]$'
    suffix_re = compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [custom_suffixes]))
    nlp.tokenizer.suffix_search = suffix_re.search


@registry.callbacks("custom_tokenizer")
def create_custom_tokenizer():
    return update_tokenizer

How I create DocBin:

text = "This is a test doc- for the demo+"
output_path = 'output/path'

nlp = spacy.blank('fr')
update_tokenizer(nlp)
doc_bin = DocBin()
doc = nlp.make_doc(text)
doc.cats = {"CatName": True}
doc_bin.add(doc)
doc_bin.to_disk(output_path)

When I call python -m spacy debug data -V config_textcat.cfg --code ./custom_tokenizer.py. The warning occurs.

I have checked in the source code of spacy, this is cause by the alignment check. For the words like 'doc-' and 'demo+' in the test text above, align.x2y.lengths[token.i] equals to 2.

How ever, when I check the tokens in eg.reference and eg.predicted are exactly the same and act the same as I defined. I don't know how it caused the alignment issue.

Is that a bug or do I miss something important ?

Thanks in advance for your response.

[Environment info]
Spacy version: 3.0.6
Python version: 3.6

Answered by adrianeboyd

Jul 29, 2021

This should have been fixed in #8776. Let's see, that was just after v3.1.1 so I don't think it's in a released version yet.

View full answer

adrianeboyd · 2021-07-29T09:10:22Z

adrianeboyd
Jul 29, 2021

This should have been fixed in #8776. Let's see, that was just after v3.1.1 so I don't think it's in a released version yet.

3 replies

Pandalei97 Jul 29, 2021
Author

Thanks for your quick response ! Will this issue cause negative effects during the training process ?

adrianeboyd Jul 29, 2021

No, it's just a problem with the output in debug data.

adrianeboyd Jul 29, 2021

The changes in debug_data.py are really minimal, so you can edit locally and double-check if you'd like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Warning of misaligned tokens in 'spacy debug data' #8843

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Warning of misaligned tokens in 'spacy debug data' #8843

Uh oh!

Uh oh!

Pandalei97 Jul 29, 2021

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Jul 29, 2021

Uh oh!

Pandalei97 Jul 29, 2021 Author

Uh oh!

adrianeboyd Jul 29, 2021

Uh oh!

adrianeboyd Jul 29, 2021

Pandalei97
Jul 29, 2021

Replies: 1 comment 3 replies

adrianeboyd
Jul 29, 2021

Pandalei97 Jul 29, 2021
Author