NER vs dep parser vs phrase matching #8691

vahuja4 · 2021-07-12T05:37:44Z

vahuja4
Jul 12, 2021

Hello,

I have a corpus which consists of sentences describing apparel. Here are a couple of examples:
This sweeping floor length beauty features a kimono tie wrap front and free flow sleeves, and is finished with flat bind piping and crystal embellishments
Also featuring a ruched waist with silver detail finished with sheer 1/2 length sleeves

From the above two sentences, as far as sleeves are concerned, I want to capture the italicized parts (everything related to sleeves). Should I use NER, dependency parsing or phrase matching? I tried dependency parsing on the first sentence and didn't work well. It did not capture the word 'free' as a dependent of 'sleeves'. I would like to understand how to decide on a technique to do this, please.

Answered by polm

Jul 12, 2021

Is this related to #8645, or is the corpus you're working with public?

Have you tried using noun chunks? spaCy has a built in noun chunks feature that can capture phrases of the pattern ADJ* NOUN+, which will capture your italic phrases. It will also capture other phrases ("floor length beauty", "kimono tie wrap front", "ruched waist"), but you can use post-processing to filter them out (for example, to only keep phrases including "sleeves").

Dependency parsing can be useful for adjective-noun constructions, but given their structure isn't very complicated in English, matching flat tag sequences (like the Matcher can do) is also effective.

Phrase matching (as opposed to the plain Matcher)…

View full answer

polm · 2021-07-12T06:34:47Z

polm
Jul 12, 2021

Is this related to #8645, or is the corpus you're working with public?

Have you tried using noun chunks? spaCy has a built in noun chunks feature that can capture phrases of the pattern ADJ* NOUN+, which will capture your italic phrases. It will also capture other phrases ("floor length beauty", "kimono tie wrap front", "ruched waist"), but you can use post-processing to filter them out (for example, to only keep phrases including "sleeves").

Dependency parsing can be useful for adjective-noun constructions, but given their structure isn't very complicated in English, matching flat tag sequences (like the Matcher can do) is also effective.

Phrase matching (as opposed to the plain Matcher) is restricted to matching things of the same token length, so it's not what you want here.

12 replies

polm Jul 12, 2021

The Jurafsky and Martin book is the best general resource, and it's written so you can skip around chapters quite freely. I still refer to it and find surprises sometimes.

vahuja4 Jul 12, 2021
Author

Thank you @polm for your replies! Much appreciated!

vahuja4 Jul 15, 2021
Author

@polm - can you please take a look at the sentence below and the corresponding noun-chunks. I am not able to make sense of it:

sentence:
with on-trend puff sleeves and a flattering square neckline
chunks:
[{'chunk': 'trend', 'root': 'trend'}, {'chunk': 'a flattering square neckline', 'root': 'neckline'}]

Would you know why sleeves is not being considered a noun? Can you please shed some light?

vahuja4 Jul 15, 2021
Author

Another example:
Designed in a relaxed fabric, it comes in a relaxed shape that flows with the body.

CHUNK': [{'chunk': 'a relaxed fabric', 'root': 'fabric'},
   {'chunk': 'it', 'root': 'it'},
   {'chunk': 'a relaxed shape', 'root': 'shape'},
   {'chunk': 'the body', 'root': 'body'}]

Here, I don't understand why it is being considered as the root of a chunk because it is a pronoun, not a noun.

polm Jul 15, 2021

Good questions here, but first as a note - the models aren't perfect, they will make mistakes, and sometimes the output will just be weird. So you should have a strategy for failing gracefully if the noun chunks are weird.

Another question - what model are you using? I'm using the small model for testing purposes.

On the specific issues...

with on-trend puff sleeves and a flattering square neckline

So one thing that's going on here is that there are different ways to think about "on-trend". Is the whole thing an adjective, or should it be considered a prepositional phrase that's embedded? spaCy's training data takes the latter view, so "trend" is a noun. The part of speech detection here is fine - "puff sleeves" are correctly detected as a compound.

What's happening here is if you look at the source you can see that there's some code to avoid generating nested chunks. However, a side effect of this is that if "trend" is detected as a chunk, since it's inside the tree of "sleeves", "sleeves" can't be a chunk head.

It might be possible to improve the way this works in noun chunks. However, if constructions like this are really important to you, you should also be able to write a Matcher rule to find them based on part of speech patterns, without reference to the dependency parse.

Here, I don't understand why it is being considered as the root of a chunk because it is a pronoun, not a noun.

The noun chunks code considers nouns, proper nouns, and pronouns to be valid noun chunk heads. You could consider that wrong but it seems reasonable to me. If you don't want pronouns it should be easy to discard with post-processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NER vs dep parser vs phrase matching #8691

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

NER vs dep parser vs phrase matching #8691

Uh oh!

Uh oh!

vahuja4 Jul 12, 2021

Replies: 1 comment · 12 replies

Uh oh!

polm Jul 12, 2021

Uh oh!

polm Jul 12, 2021

Uh oh!

vahuja4 Jul 12, 2021 Author

Uh oh!

vahuja4 Jul 15, 2021 Author

Uh oh!

vahuja4 Jul 15, 2021 Author

Uh oh!

polm Jul 15, 2021

vahuja4
Jul 12, 2021

Replies: 1 comment 12 replies

polm
Jul 12, 2021

vahuja4 Jul 12, 2021
Author

vahuja4 Jul 15, 2021
Author

vahuja4 Jul 15, 2021
Author