ValueError during first epoch #6349

nleguillarme · 2020-11-05T10:11:17Z

nleguillarme
Nov 5, 2020

How to reproduce the behaviour

python -m spacy train ner-biobert.cfg --output ./output --paths.train data/train.spacy --paths.dev data/dev.spacy --training.optimizer.learn_rate.initial_rate 1e-2 -g 0

Your Environment

spaCy version: 3.0.0rc2
Platform: Linux-4.9.0-13-amd64-x86_64-with-debian-9.12
Python version: 3.7.9

Hi, I'm trying to learn a NER model from a pre-trained transformers model, and I encounter the following exception :

ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
Set up nlp object from config
Pipeline: ['transformer', 'ner']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E # LOSS TRANS... LOSS NER ENTS_F ENTS_P ENTS_R SCORE

0 0 0.00 0.00 0.03 0.05 0.02 0.00
⚠ Aborting and saving the final best model. Encountered exception:
✔ Saved pipeline to output directory
output/model-last
Traceback (most recent call last):
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/spacy/main.py", line 4, in
setup_cli()
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/spacy/cli/_util.py", line 65, in setup_cli
command(prog_name=COMMAND)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/spacy/cli/train.py", line 59, in train_cli
train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/spacy/training/loop.py", line 105, in train
raise e
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/spacy/training/loop.py", line 85, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/spacy/training/loop.py", line 184, in train_while_improving
subbatch, drop=dropout, losses=losses, sgd=False, exclude=exclude
File "/home/leguilln/.conda/envs/leguilln_spacy_nightly/lib/python3.7/site-packages/spacy/language.py", line 1095, in update
proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
File "spacy/pipeline/transition_parser.pyx", line 305, in spacy.pipeline.transition_parser.Parser.update
File "spacy/pipeline/transition_parser.pyx", line 399, in spacy.pipeline.transition_parser.Parser.get_batch_loss
File "spacy/pipeline/_parser_internals/ner.pyx", line 280, in spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs
ValueError

ner-biobert.cfg.txt

Answered by adrianeboyd

Nov 6, 2020

Ah, looking again at the debug data output, I think you've provided your entity labels in an incorrect format. You want the character spans to cover the whole entity and you don't include the B/I- when you specify the entity span as above:

s = "blah blah blah ... Acropora Seriatopora blah blah blah"
annotation = {"entities": [(38, 59, "LIVB")]}

When you're converting from character offsets, you don't provide the IOB or BILUO tags, you just provide the top-level label for the whole span as one unit. With what you have, it's trying to learn I-LIVB as one entity type and B-LIVB as another entity type, which isn't want you want. That would explain why it's not handling the whitespace like I …

View full answer

adrianeboyd · 2020-11-05T12:16:18Z

adrianeboyd
Nov 5, 2020

I'm not 100% sure, but that looks like the kind of error you could get when you don't have enough training data. What does spacy debug data config.cfg ... (with all the override opts) show?

0 replies

nleguillarme · 2020-11-05T12:31:17Z

nleguillarme
Nov 5, 2020
Author

spacy debug data ner-biobert.cfg --paths.train data/train.spacy --paths.dev data/dev.spacy --training.optimizer.learn_rate.initial_rate 1e-2

✘ Config validation error
disabled field required
before_creation field required
after_creation field required
after_pipeline_creation field required

{'lang': 'en', 'pipeline': ['transformer', 'ner'], 'tokenizer': {'@Tokenizers': 'spacy.Tokenizer.v1'}}

If your config contains missing values, you can run the 'init fill-config'
command to fill in all the defaults, if possible:

python -m spacy init fill-config ner-biobert.cfg ner-biobert.cfg

0 replies

adrianeboyd · 2020-11-05T13:17:12Z

adrianeboyd
Nov 5, 2020

Please follow the instructions to run init fill-config and then run debug data again.

0 replies

nleguillarme · 2020-11-05T15:11:46Z

nleguillarme
Nov 5, 2020
Author

Ok so after running init fill-config, I ran spacy debug data... again and had more information on what is going wrong with my data.

============================ Data file validation ============================
✔ Corpus is loadable
✔ Pipeline can be initialized with data

=============================== Training stats ===============================
Language: en
Training pipeline: transformer, ner
21514 training docs
25289 evaluation docs
⚠ 17577 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 316810 total word(s) in the data (316810 unique)
ℹ No word vectors present in the package

========================== Named Entity Recognition ==========================
ℹ 0 new label(s), 2 existing label(s)
0 missing value(s) (tokens with '-' label)
✘ 122 invalid whitespace entity spans
⚠ 1256 entity span(s) with punctuation
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
As of spaCy v2.1.0, entity spans consisting of or starting/ending with
whitespace characters are considered invalid.
Entity spans consisting of or starting/ending with punctuation can not be
trained with a noise level > 0.

================================== Summary ==================================
✔ 4 checks passed
⚠ 2 warnings
✘ 1 error

See I have some entities like "... Acropora Seriatopora" with two whitespaces between the two tokens, which are annotated as follows [(38, 46, 'B-LIVB'), (47, 48, 'I-LIVB'), (48, 59, 'I-LIVB')].

The problem is spaCy does not support entities consisting of or starting/ending with whitespaces, so this creates an error. But I cannot trim whitespaces from my annotations, otherwise this will split the entity in two different entities, which is not correct...

How can I solve this problem ? Is there any alternative to getting rid of these problematic entities ?

0 replies

adrianeboyd · 2020-11-05T15:22:17Z

adrianeboyd
Nov 5, 2020

Whitespace in the middle of an entity shouldn't be a problem, but the entity shouldn't start or end with a whitespace token. Trimming the whitespace tokens from the beginning/end of the span shouldn't cause it to split the entity into two separate entities? Or where do you see that happening?

Can you run debug-data -V to show the counts for the individual entity types, too? (Just to make sure you have enough instances of each type, although I suspect the whitespaces are the only thing that's causing problems.)

0 replies

nleguillarme · 2020-11-05T19:00:26Z

nleguillarme
Nov 5, 2020
Author

I preprocessed my annotations to remove leading and trailing whitespaces, but I confirm that spaCy does not support whitespace entities like in my example, which happens when you have two whitespaces between two words (Acropora\s\sSeriatopora).

Everything works if I replace my whitespace entities by some dummy character like "_", but I think it's best if I simply remove them from my dataset.

Anyway, here is the result of debug data when removing all whitespace-related problems :

============================ Data file validation ============================
✔ Corpus is loadable
✔ Pipeline can be initialized with data

=============================== Training stats ===============================
Language: en
Training pipeline: transformer, ner
21514 training docs
25289 evaluation docs
⚠ 17577 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 316810 total word(s) in the data (316810 unique)
10 most common words: '652' (1), ' ' (1), 'PROCEEDINGS' (1), 'OF' (1), 'THE'
(1), 'BIOLOGICAL' (1), 'SOCIETY' (1), 'OF' (1), 'WASHINGTON' (1), ' ' (1)
ℹ No word vectors present in the package

========================== Named Entity Recognition ==========================
ℹ 0 new label(s), 2 existing label(s)
0 missing value(s) (tokens with '-' label)
Existing: 'I', 'B'
⚠ 1008 entity span(s) with punctuation
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
Entity spans consisting of or starting/ending with punctuation can not be
trained with a noise level > 0.

================================== Summary ==================================
✔ 5 checks passed
⚠ 2 warnings

0 replies

adrianeboyd · 2020-11-05T19:04:48Z

adrianeboyd
Nov 5, 2020

Hmm, I'll look into the errors related to whitespace, because I thought they could be in the middle of entities (so I in BILUO, just not B or L).

It looks like debug data may not be counting/showing the correct entity labels, either ("Existing: 'I', 'B'" is not the intended output), which may be due to some of the v3 changes. We will look into it...

0 replies

adrianeboyd · 2020-11-06T08:46:29Z

adrianeboyd
Nov 6, 2020

Ah, looking again at the debug data output, I think you've provided your entity labels in an incorrect format. You want the character spans to cover the whole entity and you don't include the B/I- when you specify the entity span as above:

s = "blah blah blah ... Acropora Seriatopora blah blah blah"
annotation = {"entities": [(38, 59, "LIVB")]}

When you're converting from character offsets, you don't provide the IOB or BILUO tags, you just provide the top-level label for the whole span as one unit. With what you have, it's trying to learn I-LIVB as one entity type and B-LIVB as another entity type, which isn't want you want. That would explain why it's not handling the whitespace like I expected. Try it again with one single character span and just the top-level entity types?

0 replies

nleguillarme · 2020-11-06T11:00:02Z

nleguillarme
Nov 6, 2020
Author

I was just wondering how spaCy handled multi-word entities. I didn't find much information about this in the doc, and I don't know why it didn't occur to me that spaCy could just handle it natively. It makes spaCy even more awesome than I thought until now.

So with top-level labels and whitespace trimming, everything seems to work just fine, training is currently running and all signals are green 👍

Thank you again @adrianeboyd for your support.

1 reply

Sumit5194 Dec 28, 2022

How did you handled above error?

svlandeg · 2020-11-12T11:09:50Z

svlandeg
Nov 12, 2020

Closing this, as the issue seems resolved (and our GH bot seems a little out of sync ;-))

0 replies

Uh oh!

ValueError during first epoch #6349

Uh oh!

How to reproduce the behaviour

Your Environment

Replies: 10 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

nleguillarme Nov 5, 2020 Author

Uh oh!

Uh oh!

nleguillarme Nov 5, 2020 Author

Uh oh!

Uh oh!

Uh oh!

nleguillarme Nov 5, 2020 Author

Uh oh!

Uh oh!

Uh oh!

nleguillarme Nov 6, 2020 Author

Uh oh!

Uh oh!

Replies: 10 comments 1 reply

nleguillarme
Nov 5, 2020
Author

nleguillarme
Nov 5, 2020
Author

nleguillarme
Nov 5, 2020
Author

nleguillarme
Nov 6, 2020
Author