Creating Doc with deps breaks sentences segmentation #8861
-
I am trying to manually create a doc (meaning not with 'nlp'), with pre determined dependencies, sentence segmentation, ... I was using the French vocab when I realized there was a problem. Each token is considered to be a sentence. From creating a Doc, in the documentation it is written that How to reproduce the behaviourHere is an example with a dummy sentence I was using in a test in my code: import spacy
from spacy.tokens import Doc
doc = Doc(spacy.blank("fr").vocab,
words=['Jean', 'Benoit', '«', "j'", 'aime', 'faire', 'des', 'tests', '»', 'indique', '-t', '-il', '.'],
sent_starts=[True, False, True, False, False, False, False, False, False, False, False, False, False],
spaces=[True, True, True, False, True, True, True, True, True, False, False, False, False],
deps=['vocative', 'flat:name', 'punct', 'nsubj', 'ROOT', 'xcomp', 'det', 'obj', 'punct', 'parataxis', 'part', 'nsubj', 'punct'])
print(list(doc.sents)) Expected output (and the output gotten without deps) - one line per sentence:
Actual output - one line per sentence
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
You can have an unlabeled dependency tree with heads without deps*, but deps don't really make sense without heads. A token is only (*For technical reasons in spacy you should use a placeholder label like |
Beta Was this translation helpful? Give feedback.
You can have an unlabeled dependency tree with heads without deps*, but deps don't really make sense without heads. A token is only
nsubj
in relation to some other token, not on its own. You do needheads
to go along withdeps
in aDoc
or you will get the default behavior that each token is the root of its own independent tree, which leads to each token being its own sentence. When creating aDoc
,heads
will always overridesent_starts
if both are set.(*For technical reasons in spacy you should use a placeholder label like
dep
instead of an empty dep label if you create an unlabeled dependency tree.)