"Empty space" is assigned with "meaningless" dep_ attribute in Spacy v3 #8710
-
I noticed that in spacy v2, "empty space" is not assigned any dep_ value (I think this behavior is correct). But in v3, "space" will get a dep_ value, and the value does not make sense to me. For example, under "en_core_web_sm=3.1.0", the space in front of "We" in the following sentence " We are going to school." will get a dep_ value "npadvmod" points to its head "going". But using "en_core_web_sm=2.3.0", the space in front "We" will NOT get any dep_ value, but will only point to its head "We". As space does not contain any real meaning, thus I suspect all dep_ values associated with them are labeled incorrect. Note the issue occurs to other types of "space" that is tagged to "_SP" as well (such as "\n" ). Can somebody kindly explain why we assigned a dep_ value for space in V3? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Thanks for the report! I suspect that this is just an oversight because the training data doesn't contain extra spaces like that - we had a similar issue with German in May where newlines were masculine. We fixed it by adding a rule to the pipeline, so we could do the same thing here. It's possible we made a decision to not have blank dependencies in default output at some point but I haven't heard of it. |
Beta Was this translation helpful? Give feedback.
-
If the doc is fully parsed, it's better not to have blank dependencies (for technical reasons, since heads don't have an "unset" value), but it would make sense for these to be treated differently. It looks like v2 had some hard-coded behavior for space tokens that was removed in v3. We usually use the placeholder label In v3 these kinds of rule-based exceptions have been moved into the {'patterns': [[{'IS_SPACE': True}]], 'attrs': {'TAG': '_SP', 'POS': 'SPACE', 'MORPH': '_'}, 'index': 0} |
Beta Was this translation helpful? Give feedback.
If the doc is fully parsed, it's better not to have blank dependencies (for technical reasons, since heads don't have an "unset" value), but it would make sense for these to be treated differently. It looks like v2 had some hard-coded behavior for space tokens that was removed in v3. We usually use the placeholder label
dep
for cases like this, but you could also use whatever label you'd like.In v3 these kinds of rule-based exceptions have been moved into the
attribute_ruler
. You can modify this rule related to all whitespace tokens in theattribute_ruler
to do this, just add"DEP": "dep"
(or whichever label you'd like) to the assignedattrs
: