Proposal: Non-NER span categorizer #3961

honnibal · 2019-07-13T11:07:49Z

honnibal
Jul 13, 2019
Maintainer

niedakh · 2019-07-13T14:46:57Z

niedakh
Jul 13, 2019

Don't know if this will help, but I've come to understand that one of the key factors is that the non-NER span detection task is vastly different in terms of detecting whether to start a span vs whether to end it.

Usually a decision to start tagging a sequence is relatively easy, there's a sequence of words that bring us to a place in the text which becomes a placeholder to express something, think of looking for a topic introduction phrase in a limited domain context.

On the other hand the decision whether to close the span is extremely hard. Has enough information been said? Is the meaning of the text shifting? Especially in compound sentences or even multi-sentence spans.

Example: if you are doing a zoning process in an urban municipal office, you will get a lot of input from people living in the area. You can imagine that there is a need to detect spans of text containing information about requirements and needs that people have with an area. You will usually get an easy tasks of finding where they start: There should be a place to relax. denoting the need for a, duh, place to relax, is fairly easy. But generalizing where a span should end in cases like: I must put my bike somewhere, as this is usually a third step on my shopping route. I will have a lot of things in my sack. is not that easy. The actual need is a bike parking spot for a person who performs shopping and carries some of the items on the bike, so the annotator would select a span put ... sack as THE_NEED label.

0 replies

honnibal · 2019-07-13T16:05:40Z

honnibal
Jul 13, 2019
Maintainer Author

the annotator would select a span put ... sack as THE_NEED label.

Well wait a second here! Yeah an annotator might initially want to do that, but you definitely can't let them. That's hopelessly broken.

Even aside from the recognition challenge, if the span scope and content is so undefined, how do you even use it in the rest of your application? You'll have this chunk of text and you'll have to like, make another bag of words model or something. You're setting yourself an unsolvable NLP problem that will take ages to annotate (as your spans are really long and uncertain), only to queue up a second difficult NLP problem downstream.

0 replies

niedakh · 2019-07-13T17:22:27Z

niedakh
Jul 13, 2019

Yes, the task does require an extra model in many cases, but actually is a reasonable expectation in a non-NER information extraction scenario from your list of use cases in the proposal. Don't chatbots require extra models to handle non-NER arguments to their intents?

I agree that this is a complicated task, ideally the annotator would have a wider ontology that allows splitting THE_NEED into multiple specific cases. But I do think that this is a fair general use case / expectation of a non-NER span marking model, to use it as tier-one approach for area of interest extraction in text. This area of interest might be quite complicated to define and many users might want to define it naively, by just tagging the maximum sequence that conveys the desired meaning. My example is an extreme case of the multi-word expression, but I don't think the use case is. I also came across this much more in conversations than in written text. What do you think? Is this in-scope?

0 replies

honnibal · 2019-07-13T17:34:30Z

honnibal
Jul 13, 2019
Maintainer Author

I wouldn't normally be so direct, but since we got to know each other a bit at IRL: No no, that's all wrong. You definitely shouldn't want this.

If you need to extract a general area of interest, you should work at the sentence or paragraph level and just do text classification. Trying to extract a long, vaguely-defined region of text as a span is a terrible bad no good idea. You may as well align it to a sentence or paragraph boundary, because you're not getting any value out of the subsentence information, and it makes the task basically impossible, both to annotate and to recognise.

This is also totally different from what's called a "multi-word expression". A MWE is basically a "word with spaces" --- a term like "artificial intelligence" is best understood as a single concept, rather than as a sum of its parts.

0 replies

ramji-c · 2019-07-13T21:32:10Z

ramji-c
Jul 13, 2019

I think it might also be useful to have a variant of the loss, that works on a vector similarity rather than % string overlap. For instance, in free text queries, multi-token names can be considered as some perturbed version of a canonical name. In such cases, a loss that works with the vector representation of the spans and the gold-standard might be better than a loss that works with surface string overlap.

0 replies

honnibal · 2019-07-16T10:11:29Z

honnibal
Jul 16, 2019
Maintainer Author

@ramji-c that's a very interesting idea. I'll think about that.

0 replies

EtienneAb3d · 2019-07-18T14:02:04Z

EtienneAb3d
Jul 18, 2019

You may have look at my chunker / entity extractor here (using spaCy):
https://github.com/EtienneAb3d/OpenNeuroSpell
It can be tested here:
http://lextract.cubaix.com/

More tools here (spell corrector, semantic grouping, ..):
http://cubaix.com/

Coming very soon: a Cross-lingual Semantic Memory = feed sentences in any language, search sentences having a similar meaning to a given sentence in any other language.

0 replies

kommerzienrat · 2019-07-25T15:18:11Z

kommerzienrat
Jul 25, 2019

I am working on a comparable issue and I am stuck with a solution for extracting the phrases. AFAIK, there are mostly solutions that expect a noun to be a central aspect of the phrase. In my current domain (service reviews), this does not help. Noun chunking is not doing the job and neither are dependency or constituency parsing. What approach do you have in mind? (I mean, do you extract all phrases, even overlapping ones?)

0 replies

EtienneAb3d · 2019-07-26T07:14:03Z

EtienneAb3d
Jul 26, 2019

@kommerzienrat, can you provide with some of your sentences to test ?

0 replies

kommerzienrat · 2019-07-26T11:20:33Z

kommerzienrat
Jul 26, 2019

Indeed, e.g. the following sentence is a good example:
"The physician was very friendly and explained everything very well even though his explanations could have had more details, as he did not take a lot of time for us, but the receptionist was unfriendly." --> Classes in this sentence are: friendliness, explanation(i.e. information by the doctor), time usage for patient; But: They are all represented by different constructions and kinds of words, such as verbs, adjectives, nouns... I am currently annotating data with prodigy. But annotations in the field of spans are, as mentioned above, quite tricky. Not to mention how to find those spans in documents in order to classify them.

0 replies

EtienneAb3d · 2019-07-26T12:38:05Z

EtienneAb3d
Jul 26, 2019

Are you looking for something like this ?

If yes, as said above, have a look at my chunker here (using spaCy):
https://github.com/EtienneAb3d/OpenNeuroSpell

0 replies

kommerzienrat · 2019-07-26T13:38:31Z

kommerzienrat
Jul 26, 2019

It does have to do with my problem but seems not to be a perfect match for such spans.
For example, what is more, I am dealing with German data mostly and there exist a lot of cases such as verbal constructs that are torn apart in the sentence. It is like 'He took his [more text] time.' Besides, I want to extract spans like 'very friendly', 'explained everything' and I am still not sure whether 'explained everything' or 'explained everything very well' should be extracted.

0 replies

Henry-E · 2019-07-26T14:40:56Z

Henry-E
Jul 26, 2019

@kommerzienrat you could check out Flair framework which does this kind of sequence labelling and is easy to get working.https://github.com/zalandoresearch/flair/blob/master/README.md

On a separate note, do you have any recommendations for guidelines about constructing the kind of ontology needed to decide on useful spans?

0 replies

tokestermw · 2019-08-02T21:28:08Z

tokestermw
Aug 2, 2019

yay thanks for this 💯.

So for my use case, the span indices are pre-determined using some rule. So I would think a loss function that's just cross entropy would be more apt. It would be nice to have as an option as part of the config. If I have time, I'll test an example.

The default for _get_all_spans seems to enumerate all possible span indices. It seems like you should use an NER model if you have to do this. If the span indices also have to be predicted, I've seen AllenNLP use a parametrized span pruner to map span indices -> word embeddings -> prune probabilities.

0 replies

kommerzienrat · 2019-08-08T11:27:51Z

kommerzienrat
Aug 8, 2019

@kommerzienrat you could check out Flair framework which does this kind of sequence labelling and is easy to get working.https://github.com/zalandoresearch/flair/blob/master/README.md

On a separate note, do you have any recommendations for guidelines about constructing the kind of ontology needed to decide on useful spans?

Thanks for the link to flair. It is good to see more great stuff coming from Europe/Germany. I regret that it might not help in my case, but anyway, I am still having a try. What is more, is that I cannot determine span boundaries due to the nature of my data. I would love to see a solution as proposed in the first post above where one missing word does not prevent a whole span from being used.

0 replies

niedakh · 2019-08-12T14:58:28Z

niedakh
Aug 12, 2019

I think that a decent use case (and a data set) would be these kinds of de-anonymization efforts where the model aims at detecting a span containing NER-like personal information: https://ai.google/tools/datasets/audio-ner-annotations/

Some are single-word and clearly NER like a person's name, others are relevant numbers that have personal meaning like a health insurance identifier or the name of the hospital a person goes to.

0 replies

kommerzienrat · 2019-08-27T14:25:56Z

kommerzienrat
Aug 27, 2019

@honnibal Is there yet something we could try to use? I am working on a comparable issue (aspect-based SA) and still have not found a working solution to identify spans in texts. Thanks in advance!

0 replies

svlandeg · 2021-04-04T18:14:18Z

svlandeg
Apr 4, 2021

More recent work-in-progress: #6747

Update: will be released as an experimental functionality in spaCy 3.1!

0 replies

Uh oh!

Proposal: Non-NER span categorizer #3961

Uh oh!

Uh oh!

honnibal Jul 13, 2019 Maintainer

Examples of non-NER span categorization

Problems with IOB/BILUO

Suggested solution

Current progress

Replies: 18 comments

Uh oh!

Uh oh!

honnibal Jul 13, 2019 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

honnibal Jul 13, 2019 Maintainer Author

Uh oh!

Uh oh!

honnibal Jul 16, 2019 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

honnibal
Jul 13, 2019
Maintainer

honnibal
Jul 13, 2019
Maintainer Author

honnibal
Jul 13, 2019
Maintainer Author

honnibal
Jul 16, 2019
Maintainer Author