Using parsed dependencies in SpanCategorizer suggester with transformers #10139

phlobo · 2022-01-25T19:10:24Z

phlobo
Jan 25, 2022

Hello!

I am implementing a custom suggester, that uses dependency parsing information (e.g., noun chunks) for suggesting spans to the SpanCategorizer.

I basically follow the config suggested here:
#10059

and added:

[nlp]
lang = "de"
pipeline = ["transformer", "parser", "spancat"]

[components.parser]
source = "de_dep_news_trf"
replace_listeners = ["model.tok2vec"]

as well as

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 100
max_steps = 20000
eval_frequency = 200
frozen_components = ["parser"]
annotating_components = ["parser"]
before_to_disk = null

In my custom suggester, I still get an error:
ValueError: [E029] noun_chunks requires the dependency parse, which requires a statistical model to be installed and loaded.

It has been suggested here:
#9201

that you need to add attrs to the embed section of tok2vec, but that section doesn't seem to exist for

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "transformer"

Answered by adrianeboyd

Jan 26, 2022

Noun chunks require both token.pos and a parse, and POS usually comes from either tagger+attribute_ruler or morphologizer in the provided trained pipelines. (Sorry, the error message looks like it's gotten a bit out-of-date.)

For German, it's: tok2vec/transformer, morphologizer, parser. Both of those listen to the same transformer and you don't want to duplicate the transformer component that many times with replace_listeners (it would be both huge and slow), so instead it would be better to use a custom name for the spancat's transformer component and use that as upstream for the spancat listener instead. Put all the new components after the existing frozen components. So:

pipeline = ["t…

View full answer

adrianeboyd · 2022-01-26T08:14:39Z

adrianeboyd
Jan 26, 2022

Noun chunks require both token.pos and a parse, and POS usually comes from either tagger+attribute_ruler or morphologizer in the provided trained pipelines. (Sorry, the error message looks like it's gotten a bit out-of-date.)

For German, it's: tok2vec/transformer, morphologizer, parser. Both of those listen to the same transformer and you don't want to duplicate the transformer component that many times with replace_listeners (it would be both huge and slow), so instead it would be better to use a custom name for the spancat's transformer component and use that as upstream for the spancat listener instead. Put all the new components after the existing frozen components. So:

pipeline = ["transformer", "morphologizer", "parser", "transformer_spancat", "spancat"]

...

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "transformer_spancat"

You don't need to add DEP anywhere with the transformer, or to use noun chunks with spancat.

This is not anything we published officially and it's messy and inefficient, but I implemented a noun chunk suggester as a proof-of-concept here, which does noun chunk +/- two tokens as the suggested spans, look in scripts/code.py:

https://github.com/adrianeboyd/projects/tree/feature/ner-bootstrapped-confidence/experimental/ner_confidence

8 replies

phlobo Jan 26, 2022
Author

Looking again at your suggester example, I added a check if doc.has_annotation("DEP") and some print statements. It looks like actually there are noun chunks assigned for some of the documents, but I do not understand why it is missing for others... In particular, since these others definitely have noun phrases that the de_dep_news_trf model is able to detect.

phlobo Jan 26, 2022
Author

After a bit more debugging, I noticed it's only a batch of 10 documents in the beginning that does not have the annotation! After these, all docs seem to have noun chunks. So I guess my problem is solved, but the behaviour is a bit puzzling

adrianeboyd Jan 26, 2022

That sounds like it might be a bug, but it's kind of hard to tell without a complete example. I ran into a number of bugs related to empty docs or docs without suggestions while working on this, but I don't think that's directly related to the noun chunk error.

phlobo Jan 26, 2022
Author

I'm happy to share the project, but can only share the data privately - if that's an option, please let me know.

adrianeboyd Jan 26, 2022

Ah, I think the problem is that the initialize step runs the suggester but the reference docs don't necessarily contain that annotation. I suspect there's no need to run the actual suggester and that running any simple suggester that suggests some spans would work just as well in that function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using parsed dependencies in SpanCategorizer suggester with transformers #10139

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using parsed dependencies in SpanCategorizer suggester with transformers #10139

Uh oh!

Uh oh!

phlobo Jan 25, 2022

Replies: 1 comment · 8 replies

Uh oh!

adrianeboyd Jan 26, 2022

Uh oh!

phlobo Jan 26, 2022 Author

Uh oh!

phlobo Jan 26, 2022 Author

Uh oh!

adrianeboyd Jan 26, 2022

Uh oh!

phlobo Jan 26, 2022 Author

Uh oh!

adrianeboyd Jan 26, 2022

phlobo
Jan 25, 2022

Replies: 1 comment 8 replies

adrianeboyd
Jan 26, 2022

phlobo Jan 26, 2022
Author

phlobo Jan 26, 2022
Author

phlobo Jan 26, 2022
Author