Example.get_aligned_parse can return multiple dependency labels for a single token #9718

pb-jeff-oneill · 2021-11-20T13:46:29Z

pb-jeff-oneill
Nov 20, 2021

The documentation for Example.get_aligned_parse is here:
https://spacy.io/api/example#get_aligned_parse

With the default option of projectivize=True, get_aligned_parse can return multiple dependency labels for a single token like this:

relcl||nsubjpass

I think most people would assume that you would get a single valid dependency label for each token. This causes much extra debugging work and headaches down the road.

Seems worth mentioning in the documentation!

Answered by adrianeboyd

Nov 22, 2021

This kind of dependency label is the expected output for the pseudo-projective dependency parsing algorithm (Nivre and Nilsson, 2005) cited in the method description for projectivize=True. To be more specific, this method uses the "Head" decoration scheme. This is only an internal representation used within the parser, which converts it back to non-projective trees containing only the original labels for the final Doc annotation.

The idea is that the pseudo-projective version can be converted back to the original non-projective dependency tree in nearly all cases. If you strip the label after the || separator, you could have a dependency tree that's projective and that contains the same l…

View full answer

adrianeboyd · 2021-11-22T07:39:46Z

adrianeboyd
Nov 22, 2021

This kind of dependency label is the expected output for the pseudo-projective dependency parsing algorithm (Nivre and Nilsson, 2005) cited in the method description for projectivize=True. To be more specific, this method uses the "Head" decoration scheme. This is only an internal representation used within the parser, which converts it back to non-projective trees containing only the original labels for the final Doc annotation.

The idea is that the pseudo-projective version can be converted back to the original non-projective dependency tree in nearly all cases. If you strip the label after the || separator, you could have a dependency tree that's projective and that contains the same labels as in the original tree, but that can't be converted back to the original tree.

There are no built-in options for other projectivization algorithms in spacy, but as long as you end up with projective trees, you can preprocess your data however you'd like before training the parser.

0 replies

pb-jeff-oneill · 2021-11-22T12:29:52Z

pb-jeff-oneill
Nov 22, 2021
Author

@adrianeboyd, thank you, your explanation is very helpful.

An etiquette question. The GitHub new issue process invites us to submit suggestions for improving the documentation and that was the purpose of the issue I submitted.

Is this kind of request to update the documentation helpful? I appreciate that you are all very busy and I don't want to needlessly take your time.

1 reply

adrianeboyd Nov 22, 2021

Sure! When you brought it up, I actually expected the docs to be missing the details, since a lot of the parser internals aren't documented that well in the docs (there's more documentation directly in the code, but there's still a lot of room for improvement).

Were you using Example outside of spaCy training? Since I think most users typically wouldn't notice the projectivization / deprojectivization steps that happen inside the parser.

pb-jeff-oneill · 2021-11-22T13:25:25Z

pb-jeff-oneill
Nov 22, 2021
Author

It is part of training. This is my specific situation:

Used en_core_web_trf to label my training corpus for training a custom tagger/parser
Used Example to update the labels to match my custom tokenizer.
Now we are experimenting with reducing the number of labels. Because we didn't expect the multiple labels, our code was crashing. It was challenging to figure out where and how the multiple labels were introduced.

I did delve into the parser code and was pleasantly surprised to see that the parser properly handles the || notation in the dependency labels. It was just really hard to find since you can't search on "||" in GitHub!

1 reply

adrianeboyd Nov 22, 2021

Ah, that makes sense. When I've worked on similar problems I've used the retokenizer to handle adjusting the annotation, but only really for merging tokens since that's a lot easier than splitting tokens or dealing with many-to-many alignments. As an example, we have an option to merge multi-word tokens in the UD corpus converter:

spaCy/spacy/training/converters/conllu_to_docs.py

Lines 254 to 296 in 52b8c2d

    
           def merge_conllu_subtokens(lines, doc): 
        
               # identify and process all subtoken spans to prepare attrs for merging 
        
               subtok_spans = [] 
        
               for line in lines: 
        
                   parts = line.split("\t") 
        
                   id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts 
        
                   if "-" in id_: 
        
                       subtok_start, subtok_end = id_.split("-") 
        
                       subtok_span = doc[int(subtok_start) - 1 : int(subtok_end)] 
        
                       subtok_spans.append(subtok_span) 
        
                       # create merged tag, morph, and lemma values 
        
                       tags = [] 
        
                       morphs = {} 
        
                       lemmas = [] 
        
                       for token in subtok_span: 
        
                           tags.append(token.tag_) 
        
                           lemmas.append(token.lemma_) 
        
                           if token._.merged_morph: 
        
                               for feature in token._.merged_morph.split("|"): 
        
                                   field, values = feature.split("=", 1) 
        
                                   if field not in morphs: 
        
                                       morphs[field] = set() 
        
                                   for value in values.split(","): 
        
                                       morphs[field].add(value) 
        
                       # create merged features for each morph field 
        
                       for field, values in morphs.items(): 
        
                           morphs[field] = field + "=" + ",".join(sorted(values)) 
        
                       # set the same attrs on all subtok tokens so that whatever head the 
        
                       # retokenizer chooses, the final attrs are available on that token 
        
                       for token in subtok_span: 
        
                           token._.merged_orth = token.orth_ 
        
                           token._.merged_lemma = " ".join(lemmas) 
        
                           token.tag_ = "_".join(tags) 
        
                           token._.merged_morph = "|".join(sorted(morphs.values())) 
        
                           token._.merged_spaceafter = ( 
        
                               True if subtok_span[-1].whitespace_ else False 
        
                           ) 
        
               with doc.retokenize() as retokenizer: 
        
                   for span in subtok_spans: 
        
                       retokenizer.merge(span) 
        
               return doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Example.get_aligned_parse can return multiple dependency labels for a single token #9718

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Example.get_aligned_parse can return multiple dependency labels for a single token #9718

Uh oh!

pb-jeff-oneill Nov 20, 2021

Replies: 3 comments · 2 replies

Uh oh!

adrianeboyd Nov 22, 2021

Uh oh!

pb-jeff-oneill Nov 22, 2021 Author

Uh oh!

adrianeboyd Nov 22, 2021

Uh oh!

pb-jeff-oneill Nov 22, 2021 Author

Uh oh!

adrianeboyd Nov 22, 2021

pb-jeff-oneill
Nov 20, 2021

Replies: 3 comments 2 replies

adrianeboyd
Nov 22, 2021

pb-jeff-oneill
Nov 22, 2021
Author

pb-jeff-oneill
Nov 22, 2021
Author