How do I retrieve spacy's compound words? #11969

AnnemarieWittig · 2022-12-13T13:33:15Z

AnnemarieWittig
Dec 13, 2022

Hi!

I am currently trying to retrieve any compound words from my sentence. For example, I will have a sentence such as "This is Angela Merkel" and would get pos tags (simply printing the texts) such as:

This this PRON DT nsubj Xxxx True True
is be AUX VBZ ROOT xx True True
Angela Angela PROPN NNP compound Xxxxx True False
Merkel Merkel PROPN NNP attr Xxxxx True False
. . PUNCT . punct . False False

Now I know and can see using the visualizer that Angela is compound to Merkel, but I don't quite understand yet how I find this information using just Python. Could someone point me there?

My test script:

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is Angela Merkel.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

answered in subthread / subcomment #11969 (reply in thread)

Answered by polm

Dec 14, 2022

The compound relation is a dependency relation and is set by the (dependency) parser, you can access it with token.dep_.

There's not a special function for getting words marked with the compound relation, but merge_noun_chunks should merge compound words (in addition to other chunks).

View full answer

polm · 2022-12-14T04:48:33Z

polm
Dec 14, 2022

The compound relation is a dependency relation and is set by the (dependency) parser, you can access it with token.dep_.

There's not a special function for getting words marked with the compound relation, but merge_noun_chunks should merge compound words (in addition to other chunks).

3 replies

AnnemarieWittig Dec 19, 2022
Author

Thank you for your answer! However, it is not quite what I need.

Yes, accessing token.dep_ is how I know that there are words marked as compounds (as seen in my output listed above) and from the visualization I know that the compound "Merkel" points to "Angela":

What I need to find out is how I can access that information in code.
Something like (pseudo code):

if STRING is compound:
   return all STRING2 that STRING is compound to

Obviously I might have more complex sentences in the actual situation, possibly containing multiple compound words which are not following right behind each other. Merging the words wouldn't really be working for me as I don't want or need them merged and still prefer them all separate.

polm Dec 20, 2022

Ah, OK, thanks for explaining - that shouldn't be very hard then. You can do something like this:

for tok in doc:
    for child in tok.children:
        if child.dep_ == "compound":
            print(f"{child} is compound to {tok}")

You could also use the DependencyMatcher.

AnnemarieWittig Dec 20, 2022
Author

Your snipped already did it! Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How do I retrieve spacy's compound words? #11969

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How do I retrieve spacy's compound words? #11969

Uh oh!

Uh oh!

AnnemarieWittig Dec 13, 2022

Replies: 1 comment · 3 replies

Uh oh!

polm Dec 14, 2022

Uh oh!

AnnemarieWittig Dec 19, 2022 Author

Uh oh!

polm Dec 20, 2022

Uh oh!

AnnemarieWittig Dec 20, 2022 Author

AnnemarieWittig
Dec 13, 2022

Replies: 1 comment 3 replies

polm
Dec 14, 2022

AnnemarieWittig Dec 19, 2022
Author

AnnemarieWittig Dec 20, 2022
Author