Creating Doc with deps breaks sentences segmentation #8861

dorianve · 2021-08-01T16:31:45Z

dorianve
Aug 1, 2021

I am trying to manually create a doc (meaning not with 'nlp'), with pre determined dependencies, sentence segmentation, ... I was using the French vocab when I realized there was a problem.

Each token is considered to be a sentence.

From creating a Doc, in the documentation it is written that heads will override sent_starts. Does deps actually need heads to be set instead of sent_start ?

How to reproduce the behaviour

Here is an example with a dummy sentence I was using in a test in my code:

import spacy
from spacy.tokens import Doc

doc = Doc(spacy.blank("fr").vocab, 
     words=['Jean', 'Benoit', '«', "j'", 'aime', 'faire', 'des', 'tests', '»', 'indique', '-t', '-il', '.'],
     sent_starts=[True, False, True, False, False, False, False, False, False, False, False, False, False],
     spaces=[True, True, True, False, True, True, True, True, True, False, False, False, False],
     deps=['vocative', 'flat:name', 'punct', 'nsubj', 'ROOT', 'xcomp', 'det', 'obj', 'punct', 'parataxis', 'part', 'nsubj', 'punct'])
print(list(doc.sents))

Expected output (and the output gotten without deps) - one line per sentence:

[Jean Benoit, 
« j'aime faire des tests » indique-t-il.]

Actual output - one line per sentence

[Jean, 
Benoit, 
«,
j', 
aime, 
faire, 
des, 
tests, 
», 
indique, 
-t, 
-il, 
.]

Your Environment

Operating System: Manjaro Linux
Python Version Used: 3.9.5
spaCy Version Used: 3.0.6
Environment Information: Linux-5.4.131-1-MANJARO-x86_64-with-glibc2.33

Answered by adrianeboyd

Aug 2, 2021

You can have an unlabeled dependency tree with heads without deps*, but deps don't really make sense without heads. A token is only nsubj in relation to some other token, not on its own. You do need heads to go along with deps in a Doc or you will get the default behavior that each token is the root of its own independent tree, which leads to each token being its own sentence. When creating a Doc, heads will always override sent_starts if both are set.

(*For technical reasons in spacy you should use a placeholder label like dep instead of an empty dep label if you create an unlabeled dependency tree.)

View full answer

adrianeboyd · 2021-08-02T07:07:41Z

adrianeboyd
Aug 2, 2021

You can have an unlabeled dependency tree with heads without deps*, but deps don't really make sense without heads. A token is only nsubj in relation to some other token, not on its own. You do need heads to go along with deps in a Doc or you will get the default behavior that each token is the root of its own independent tree, which leads to each token being its own sentence. When creating a Doc, heads will always override sent_starts if both are set.

(*For technical reasons in spacy you should use a placeholder label like dep instead of an empty dep label if you create an unlabeled dependency tree.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Creating Doc with deps breaks sentences segmentation #8861

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Creating Doc with deps breaks sentences segmentation #8861

Uh oh!

dorianve Aug 1, 2021

How to reproduce the behaviour

Your Environment

Replies: 1 comment

Uh oh!

adrianeboyd Aug 2, 2021

dorianve
Aug 1, 2021

adrianeboyd
Aug 2, 2021