Setting-up training file & config to allow SpanCat to use transformer embedding from adjacent sentences #11341

egumasa · 2022-08-19T06:54:42Z

egumasa
Aug 19, 2022

Hi all,

I am fascinated by the recent addition of SpanCat and I have already played around with it for some use cases (they work nicely)! I am planning to train a SpanCat model that considers contexts beyond the immediate sentences when predicting categories.
I would like to use transformer embedding for this training, so I wonder if spacy allows Transformer embedding across sentences (which I assume is the default behavior that is allowed by setting different padding widths).

When training a SpanCat model, so the prediction is made using Transformer embedding which spans across immediate sentences, how should I set up the IOB dataset as well as training configuration? Here are some more details on the project. I appreciate any insights from the community!

Dataset

My team is currently annotating randomly sampled sequences of sentences (3 adjacent sentences) from university written assignments for their discourse strategies.
For example, "Although-clause" can be categorized as COUNTER as it counters other views, and "The paper indicated" is categorized as ENDORSE. This task is set up so SpanCat can be used (I already tested this using a single-sentence dataset).
Due to the resource limitation, we chose to randomly sample "segments" of text, rather than annotating whole documents.
So, my dataset includes span annotation (using IOB scheme) that has three sentences as single units (similar to a "doc").
Let's say we have sentences A, B, C, D, E and F. In my data, sentences A, B and C should be considered as one unit (they are adjacent sentences randomly sampled from the corpus). D, E, and F are another independently sampled sequence of three sentences.

The following is an example sentence for how the annotation scheme looks in the IOB format.

If	B-ENTERTAIN	O	O	O	O
there	I-ENTERTAIN	O	O	O	O
is	I-ENTERTAIN	O	O	O	O
a	I-ENTERTAIN	O	O	O	O
young	I-ENTERTAIN	O	O	O	O
people	I-ENTERTAIN	O	O	O	O
man	O	O	O	O	O
get	B-MONOGLOSS	O	O	O	O
lungs	O	O	O	O	O
disease	O	O	O	O	O
and	O	O	O	O	O
the	O	O	O	O	O
kisses	O	O	O	O	O
of	O	O	O	O	O
smell	O	O	O	O	O
a	O	O	O	O	O
cigarette	O	O	O	O	O
can	B-ENTERTAIN	O	O	O	O
added	O	O	O	O	O
him	O	O	O	O	O
disease	O	O	O	O	O
,	O	O	O	O	O
of	B-CONCUR	O	O	O	O
course	I-CONCUR	O	O	O	O
the	O	O	O	O	O
young	O	O	O	O	O
man	O	O	O	O	O
could	B-ENTERTAIN	O	O	O	O
died	O	O	O	O	O
.	O	O	O	O	O

Question regarding the IOB dataset

For an experiment, I have used preprocess_genia.py to preprocess the preliminary annotated data, which has only single-sentence examples.
My questions include:
Are there any ways that each of the train.spacy, dev.spacy, test.spacy data would have document boundary indicators inside, which are considered during training?
How do I tell preprocessing script to have multiple sentences to be recognized as sequences in a same document (sentence A, B and C as one unit; D, E and F as another unit) in the training dataset?

Intended behavior during training

I would like to feed the three-sentence sequences of sentences as constituting the document boundary during the training.
I want the Transformer embedding to consider the span of the sequence if they are within the limit of the Transformer.
Ideally, I would like to freeze the dep, pos, and sentence recognizer layer from en_core_web_trf and add an additional trainable_transformer layer for the span_finder and span_categorizer. (this step is already confirmed).

Does batch setting help to tailor to my needs?

When the three-sentence segments are somehow set-up as "the document" in the training data, does Spacy training script use that "document boundary" to feed them to Transformer? That is, the same "doc" is simultaneously fed to Transformer if the lengths of sequence does not exceed the limit of the particular pre-trained model or the training configuration (e.g., padding)?
I wonder if any of the batch setting help to achieve my goal of including three-sentence sequence as one document. For instance, how does setting batch-by-words vs batch-by-sequence work in the backend?
In other words, how do they recognize the document boundary so that the Transformer embedding are created across sentences within the same document.

Here is the config file I have, which worked fine when I used single-sentence dataset ( = the basic config works as intended).

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
source = "en_core_web_trf"

[vars]
spans_key = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer", "tagger", "parser", "ner", "span_finder", "spancat"]
batch_size = 64
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}


[components]

[components.transformer]
source = "en_core_web_trf"

# [components.transformer]
# factory = "transformer"
# max_batch_items = 4096
# set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

# [components.transformer.model]
# @architectures = "spacy-transformers.TransformerModel.v1"
# name = "roberta-base"

# [components.transformer.model.get_spans]
# @span_getters = "spacy-transformers.strided_spans.v1"
# window = 128
# stride = 96

# [components.transformer.model.tokenizer_config]
# use_fast = true

[components.tagger]
source = ${paths.source}
#upstream = "*"
# replace_listeners = ["model.transformer"]

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "*"
pooling = {"@layers":"reduce_mean.v1"}

[components.parser]
source = ${paths.source}

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "*"
pooling = {"@layers":"reduce_mean.v1"}


[components.ner]
source = ${paths.source}

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "*"
pooling = {"@layers":"reduce_mean.v1"}


[components.span_finder]
factory = "experimental_span_finder"
threshold = 0.30
predicted_key = "span_candidates"
training_key = ${vars.spans_key}
min_length = 0
max_length = 0

[components.span_finder.scorer]
@scorers = "spacy-experimental.span_finder_scorer.v1"
predicted_key = ${components.span_finder.predicted_key}
training_key = ${vars.spans_key}

[components.span_finder.model]
@architectures = "spacy-experimental.SpanFinder.v1"

[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO=2

[components.span_finder.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "*"

[components.span_finder.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.spancat]
factory = "spancat"
max_positive = 2
spans_key = ${vars.spans_key}
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "*"

[components.spancat.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.spancat.suggester]
@misc = "spacy-experimental.span_finder_suggester.v1"
candidates_key = ${components.span_finder.predicted_key}

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 300
max_epochs = 0
max_steps = 20000
eval_frequency = 25
frozen_components = ["transformer", "parser", "tagger", "ner"]
annotating_components = ["transformer", "parser", "tagger", "ner", "span_finder"]
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1" # trying to bath by sentences
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 64 # include 64 sentences at first
stop = 256 # increase the batch size to 256
compound = 1.001

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
span_finder_span_candidates_f = 0.1
span_finder_span_candidates_p = 0.0
span_finder_span_candidates_r = 0.3
spans_sc_p = 0.0
spans_sc_r = 0.0
spans_sc_f = 0.6
dep_las_per_type = null
sents_p = null
sents_r = null
ents_per_type = null
tag_acc = null
dep_uas = null
dep_las = null
sents_f = null
ents_f = null
ents_p = null
ents_r = null
lemma_acc = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I really appreciate any insights into how to make this possible! Thank you so much in advance!

Answered by polm

Aug 19, 2022

When training a SpanCat model, so the prediction is made using Transformer embedding which spans across immediate sentences, how should I set up the IOB dataset as well as training configuration?

The most important thing for your data is that your categories are well designed and the annotations reflect the output you want consistently. The architecture used to get features from your raw text will not change that.

It sounds like you have a lot of questions about how spaCy works internally. It's good to understand what your tools are doing, but in order to answer these it's probably easiest if you try the default settings, just to have a complete functional pipeline, and then tweak it fr…

View full answer

polm · 2022-08-19T09:55:34Z

polm
Aug 19, 2022

When training a SpanCat model, so the prediction is made using Transformer embedding which spans across immediate sentences, how should I set up the IOB dataset as well as training configuration?

The most important thing for your data is that your categories are well designed and the annotations reflect the output you want consistently. The architecture used to get features from your raw text will not change that.

It sounds like you have a lot of questions about how spaCy works internally. It's good to understand what your tools are doing, but in order to answer these it's probably easiest if you try the default settings, just to have a complete functional pipeline, and then tweak it from there so you can measure iterative improvements.

Regarding "document boundaries" - spaCy just uses lists of Docs in training. It doesn't have any other conception of a division within a Doc, and many components, including Transformers, don't consider sentence boundaries. In this case it sounds like you should just treat each of your three-sentence fragments as a Doc. So this would be set up exactly the same as your single-sentence dataset.

You could also reassemble your fragments into larger docs if you have the metadata to do that, but it's not clear that either approach would be superior, so I would start with the simpler approach (using the fragments directly) to get a baseline. (It's not clear to me what resource constraints caused you to split the docs into three sentence segments. Was it limited hardware or limited annotation resources or...?) To attach the metadata to the Doc you can use underscore attributes.

Regarding how the Transformer handles long documents, see the docs on span getters - basically long docs are sliced up before being passed to the Transformer, and then the results of the slices are combined to get the representation of each token. Exactly how this is done is configurable.

3 replies

egumasa Aug 25, 2022
Author

Thank you @polm for detailed reply on my question. I really appreciate your insights.

For the first two comments, I self-evaluated that I am good to go to next step as I already have working-model trained using the single-sentence dataset (and the categories and annotation schemes are defined and I am not worried about the annotation quality.).

With that said, I have some follow-up questions/clarification, which I wonder if you could elaborate, if possible.

Regarding "document boundaries" - spaCy just uses lists of Docs in training. It doesn't have any other conception of a division within a Doc, and many components, including Transformers, don't consider sentence boundaries. In this case it sounds like you should just treat each of your three-sentence fragments as a Doc. So this would be set up exactly the same as your single-sentence dataset.

I understand now that train.spacy is a list of docs, and I need to set up each doc in the list to be a three-sentence example. Did I understand your suggestion correctly? So to extend that knowledge, if I have a complete document with the span layer and would like to consider the surrounding context when training, one way to do this is to have IOB formatted files for each document, and write a script to store each of the IOB document as a doc in the list of docs and serialize as train.spacy?

You could also reassemble your fragments into larger docs if you have the metadata to do that, but it's not clear that either approach would be superior, so I would start with the simpler approach (using the fragments directly) to get a baseline. (It's not clear to me what resource constraints caused you to split the docs into three-sentence segments. Was it limited hardware or limited annotation resources or...?) To attach the metadata to the Doc you can use underscore attributes.

Thank you for further suggestions. I feel like the first approach would be sufficient for my current application. The issue was that I only had budget for 1000~200 sequences for the dataset, and using entire documents would limit the variety of written pieces as a reflection of different writing styles (because a single document may take similar discourse strategies overall) in the training data. I took a middle-ground approach to balance the necessity of contextual clue and representing a variety of writing in the dataset.

Regarding how the Transformer handles long documents, see the docs on span getters - basically long docs are sliced up before being passed to the Transformer, and then the results of the slices are combined to get the representation of each token. Exactly how this is done is configurable.

Thank you so much for pointing to this document. So, I see that doc_span would be the best option for my use case, followed by striped spans, which I think is the default setting I have been using. When you say "long docs are sliced up before being passed to the Transformer", is this how doc_span deals with longer sequence that is possible for the Transformer model? Does doc_span consume more memory and degrade the efficiency when predicting longer documents (say 2000+ words sequence)? I feel like strided span may make more sense as long-distant dependencies are less informative in making span prediction (even though immediate contextual information will help in my task).

Thank you for all the useful information again!

polm Aug 30, 2022

I understand now that train.spacy is a list of docs, and I need to set up each doc in the list to be a three-sentence example. Did I understand your suggestion correctly? So to extend that knowledge, if I have a complete document with the span layer and would like to consider the surrounding context when training, one way to do this is to have IOB formatted files for each document, and write a script to store each of the IOB document as a doc in the list of docs and serialize as train.spacy?

Yes, my suggestion is to just convert your three-sentence examples to Docs and save them.

I'm not sure what you mean by the "span layer", do you mean entity annotations?

You can set up multiple IOB documents, but if you have them in some other format that's easy to convert to spaCy Docs it's not necessary. Only the final Docs are necessary.

Thank you so much for pointing to this document. So, I see that doc_span would be the best option for my use case, followed by striped spans, which I think is the default setting I have been using. When you say "long docs are sliced up before being passed to the Transformer", is this how doc_span deals with longer sequence that is possible for the Transformer model? Does doc_span consume more memory and degrade the efficiency when predicting longer documents (say 2000+ words sequence)? I feel like strided span may make more sense as long-distant dependencies are less informative in making span prediction (even though immediate contextual information will help in my task).

It's hard to say what the right thing is here, I would just experiment. If your documents are too long I suspect doc_spans will just not work very well, as then handling long documents would be handled by the default Transformers behaviour.

egumasa Dec 3, 2022
Author

The recommended procedure is working as you suggested. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Setting-up training file & config to allow SpanCat to use transformer embedding from adjacent sentences #11341

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Setting-up training file & config to allow SpanCat to use transformer embedding from adjacent sentences #11341

Uh oh!

egumasa Aug 19, 2022

Dataset

Question regarding the IOB dataset

Intended behavior during training

Does batch setting help to tailor to my needs?

Replies: 2 comments · 3 replies

Uh oh!

polm Aug 19, 2022

Uh oh!

Uh oh!

egumasa Aug 25, 2022 Author

Uh oh!

polm Aug 30, 2022

Uh oh!

egumasa Dec 3, 2022 Author

egumasa
Aug 19, 2022

Replies: 2 comments 3 replies

polm
Aug 19, 2022

egumasa Aug 25, 2022
Author

egumasa Dec 3, 2022
Author