Parser not annotating sentence boundaries during training #11369

Larsdegroot · 2022-08-24T08:37:07Z

Larsdegroot
Aug 24, 2022

I'm trying to create a pipeline for relation extraction. For this i've modified the relation extractor in this spacy project to use different features. The features that i've chosen are the words in a sentence between the two entities. Because of this i'm using doc.sents in the get_instances() function.

i've added a parser to the pipeline and added annotating_components = ["parser"] to the training block in my config

However i'm getting this error after running spacy train configs/config_shared_trf.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py --gpu-id 0:

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

i have a suspicion it might have something to do with the custom data reader, but i don't see how it interferes. My understanding of the reader is that it also loads in the entities defined in the training data (gold data) because the relation component needs it to make predictions. However this holds no clue as to why the parser is not annotating the examples being yielded here.

@spacy.registry.readers("Gold_ents_Corpus.v1")
def create_docbin_reader(file: Path) -> Callable[["Language"], Iterable[Example]]:
    return partial(read_files, file)


def read_files(file: Path, nlp: "Language") -> Iterable[Example]:
    """Custom reader that keeps the tokenization of the gold data,
    and also adds the gold GGP annotations as we do not attempt to predict these."""
    doc_bin = DocBin().from_disk(file)
    docs = doc_bin.get_docs(nlp.vocab)
    
    for gold in docs:
        
        pred = Doc(
            nlp.vocab,
            words=[t.text for t in gold],
            spaces=[t.whitespace_ for t in gold]
        )
        
        pred.ents = gold.ents
        yield Example(pred, gold)

Entire config:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 342
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer","parser","ner", "relation_extractor"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.relation_extractor]
factory = "relation_extractor"
threshold = 0.5

[components.relation_extractor.model]
@architectures = "rel_model.v1"

[components.relation_extractor.model.create_instance_tensor]
@architectures = "rel_instance_tensor.v1"

[components.relation_extractor.model.create_instance_tensor.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.relation_extractor.model.create_instance_tensor.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.pooling]
@layers = "reduce_mean.v1"

[components.relation_extractor.model.create_instance_tensor.get_instances]
@misc = "rel_instance_generator.v1"

[components.relation_extractor.model.classification_layer]
@architectures = "rel_classification_layer.v1"
nI = null
nO = null

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = ["parser"]
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
dep_uas = 0.25
dep_las = 0.25
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0
ents_f = 0.5
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Entire error:

================================= train_gpu =================================
Running command: /home/ldegroot/anaconda3/envs/re/bin/python3 -m spacy train configs/config_shared_trf.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/cupy/_environment.py:437: UserWarning: 
--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda111

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------

  warnings.warn(f'''
ℹ Saving to output directory: training
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-08-24 08:21:16,566] [INFO] Set up nlp object from config
[2022-08-24 08:21:16,576] [INFO] Pipeline: ['transformer', 'parser', 'ner', 'relation_extractor']
[2022-08-24 08:21:16,580] [INFO] Created vocabulary
[2022-08-24 08:21:16,581] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/spacy/cli/train.py", line 72, in train
    nlp = init_nlp(config, use_gpu=use_gpu)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/spacy/training/initialize.py", line 84, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
  File "/home/ldegroot/anaconda3/envs/re/lib/python3.10/site-packages/spacy/language.py", line 1308, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "/home/ldegroot/BERT_RelationshipExtraction/scripts/rel_pipe.py", line 173, in initialize
    label_sample = self._examples_to_truth(subbatch)
  File "/home/ldegroot/BERT_RelationshipExtraction/scripts/rel_pipe.py", line 183, in _examples_to_truth
    nr_instances += len(self.model.attrs["get_instances"](eg.reference))
  File "/home/ldegroot/BERT_RelationshipExtraction/scripts/rel_model.py", line 40, in get_instances
    for sentence in doc.sents:
  File "spacy/tokens/doc.pyx", line 875, in sents
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

Answered by adrianeboyd

Aug 31, 2022

annotating_components only sets annotation in the predicted doc, not in the reference docs, so if you need sentence boundaries in get_instances for the reference docs, you have to set them separately before training, either directly in the saved .spacy annotation or with a custom corpus reader.

For the custom reader, the tokenization for blank:en may not match the saved tokenization, so it would be better to process the gold Doc object with the sentencizer rather than gold.text.

For testing, you can also have the corpus reader add the sentence boundaries to the predicted docs, but in practice you would want a component in the pipeline that adds this or you wouldn't be able to run the comp…

View full answer

Larsdegroot · 2022-08-25T07:20:04Z

Larsdegroot
Aug 25, 2022
Author

I've also tried to train with just the sentencizer and both parser and sentencizer and both result in the same error.

0 replies

Larsdegroot · 2022-08-25T07:34:57Z

Larsdegroot
Aug 25, 2022
Author

I tried to code a workaround by adding the sentence boundaries in the file loader:

I made these changes to costum_functions.py:

@spacy.registry.readers("Gold_ents_Corpus.v1")
def create_docbin_reader(file: Path) -> Callable[["Language"], Iterable[Example]]:
    return partial(read_files, file)


def read_files(file: Path, nlp: "Language") -> Iterable[Example]:
    """Custom reader that keeps the tokenization of the gold data,
    and also adds the gold GGP annotations as we do not attempt to predict these."""
    doc_bin = DocBin().from_disk(file)
    docs = doc_bin.get_docs(nlp.vocab)
    
    sentencizer = spacy.load("blank:en") #CHANGE HERE
    sentencizer.add_pipe("sentencizer") #CHANGE HERE
    
    for gold in docs:
        
        temp_doc = sentencizer(gold.text) #CHANGE HERE
        sent_starts = [token.is_sent_start for token in temp_doc] #CHANGE HERE
        
        pred = Doc(
            nlp.vocab,
            words=[t.text for t in gold],
            spaces=[t.whitespace_ for t in gold],
            sent_starts=sent_starts #CHANGE HERE
        )
        
        pred.ents = gold.ents
        yield Example(pred, gold)

But this results in the same error.

0 replies

adrianeboyd · 2022-08-31T09:22:57Z

adrianeboyd
Aug 31, 2022

annotating_components only sets annotation in the predicted doc, not in the reference docs, so if you need sentence boundaries in get_instances for the reference docs, you have to set them separately before training, either directly in the saved .spacy annotation or with a custom corpus reader.

For the custom reader, the tokenization for blank:en may not match the saved tokenization, so it would be better to process the gold Doc object with the sentencizer rather than gold.text.

For testing, you can also have the corpus reader add the sentence boundaries to the predicted docs, but in practice you would want a component in the pipeline that adds this or you wouldn't be able to run the component on new data, and then this is where you could use annotating_components instead of setting annotation on the pred Doc in the corpus reader for a realistic training configuration based on the available pipeline components.

For simplicity, a sentencizer or a sourced senter would be easier than the parser for sentence boundaries in annotating_components, since you don't have to deal with the fact that the parser typically listens to a tok2vec/transformer component. (You can use replace_listeners with the parser, but this going to be very large/slow for a transformer pipeline.)

9 replies

Larsdegroot Oct 20, 2022
Author

So the training data setup is sentences with ner annotations and rel annotations. The only thing being trained is the Relation extractor (with the transformer it's listening to), So the gold docs not having sentence data should not interfere with training right?

Would it be easier to get a new training dataset that has parser annotations. Since i'm sourcing the parser and not updating it, it should annotate the same annotations during getting the training data as during runtime.

with this setup of setting parser annotations on the training data i could just train the Relation extractor and later use spacy assemble to make a ['transformer', 'parser', 'transformer_ner', 'ner', 'transformer_rel', 'rel'] pipeline (i know using three transformers is not efficient at all but speed is not needed for my use case)

adrianeboyd Oct 20, 2022

Why/how are you using get_lca_matrix in the relation extractor?

Larsdegroot Oct 20, 2022
Author

i've added a function that gets the shortest dependency path (sdp) to the model. I want to add the token vectors of the sdp as a feature:

@spacy.registry.misc("rel_sdp_extracter.v1")
def create_sdp_extracter() -> Callable[[Doc, Tuple],  List]:
    def shortest_dependency_path(doc, instance) -> List:
        subj = instance[0].root
        obj = instance[1].root

        lca = doc.get_lca_matrix()[subj.i, obj.i]
        assert lca != -1, "No common ancestor."

        # get path of subject untill lowest common ancestor
        subj_path = []
        node = doc[subj.i]
        while node.i != lca:
            node = node.head
            subj_path.append(node)

        # get path of object untill lowest common ancestor
        obj_path = []
        node = doc[obj.i]
        while node.i != lca:
            node = node.head
            obj_path.append(node)

        return subj_path[:-1] + obj_path[::-1]
    return shortest_dependency_path

adrianeboyd Oct 20, 2022

If you don't have gold parses on your gold data, then you would need to run annotating_nlp on both the gold and pred docs in the corpus reader instead, and skip the sentencizer. Just be careful that you're really only running the parser and not modifying your NER annotation.

You could also run the parser and NER in advance and save it as part of the gold data and just copy it over in the corpus reader. When you create the pred doc, you would copy over the parse with Doc(..., deps=[], heads=[]).

Larsdegroot Oct 20, 2022
Author

Great thanks you so much for all the help! i'm never disappointed when asking for help here :).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parser not annotating sentence boundaries during training #11369

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Parser not annotating sentence boundaries during training #11369

Uh oh!

Larsdegroot Aug 24, 2022

Entire config:

Entire error:

Replies: 3 comments · 9 replies

Uh oh!

Larsdegroot Aug 25, 2022 Author

Uh oh!

Larsdegroot Aug 25, 2022 Author

Uh oh!

adrianeboyd Aug 31, 2022

Uh oh!

Larsdegroot Oct 20, 2022 Author

Uh oh!

adrianeboyd Oct 20, 2022

Uh oh!

Uh oh!

Larsdegroot Oct 20, 2022 Author

Uh oh!

adrianeboyd Oct 20, 2022

Uh oh!

Larsdegroot Oct 20, 2022 Author

Larsdegroot
Aug 24, 2022

Replies: 3 comments 9 replies

Larsdegroot
Aug 25, 2022
Author

Larsdegroot
Aug 25, 2022
Author

adrianeboyd
Aug 31, 2022

Larsdegroot Oct 20, 2022
Author

Larsdegroot Oct 20, 2022
Author

Larsdegroot Oct 20, 2022
Author