ValueError during first epoch #6349
-
How to reproduce the behaviour
Your Environment
Hi, I'm trying to learn a NER model from a pre-trained transformers model, and I encounter the following exception : ℹ Using GPU: 0 =========================== Initializing pipeline =========================== ============================= Training pipeline ============================= 0 0 0.00 0.00 0.03 0.05 0.02 0.00 |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 1 reply
-
I'm not 100% sure, but that looks like the kind of error you could get when you don't have enough training data. What does |
Beta Was this translation helpful? Give feedback.
-
✘ Config validation error {'lang': 'en', 'pipeline': ['transformer', 'ner'], 'tokenizer': {'@Tokenizers': 'spacy.Tokenizer.v1'}} If your config contains missing values, you can run the 'init fill-config' python -m spacy init fill-config ner-biobert.cfg ner-biobert.cfg |
Beta Was this translation helpful? Give feedback.
-
Please follow the instructions to run |
Beta Was this translation helpful? Give feedback.
-
Ok so after running ============================ Data file validation ============================ =============================== Training stats =============================== ============================== Vocab & Vectors ============================== ========================== Named Entity Recognition ========================== ================================== Summary ================================== See I have some entities like "... Acropora Seriatopora" with two whitespaces between the two tokens, which are annotated as follows [(38, 46, 'B-LIVB'), (47, 48, 'I-LIVB'), (48, 59, 'I-LIVB')]. The problem is spaCy does not support entities consisting of or starting/ending with whitespaces, so this creates an error. But I cannot trim whitespaces from my annotations, otherwise this will split the entity in two different entities, which is not correct... How can I solve this problem ? Is there any alternative to getting rid of these problematic entities ? |
Beta Was this translation helpful? Give feedback.
-
Whitespace in the middle of an entity shouldn't be a problem, but the entity shouldn't start or end with a whitespace token. Trimming the whitespace tokens from the beginning/end of the span shouldn't cause it to split the entity into two separate entities? Or where do you see that happening? Can you run |
Beta Was this translation helpful? Give feedback.
-
I preprocessed my annotations to remove leading and trailing whitespaces, but I confirm that spaCy does not support whitespace entities like in my example, which happens when you have two whitespaces between two words (Acropora\s\sSeriatopora). Everything works if I replace my whitespace entities by some dummy character like "_", but I think it's best if I simply remove them from my dataset. Anyway, here is the result of debug data when removing all whitespace-related problems : ============================ Data file validation ============================ =============================== Training stats =============================== ============================== Vocab & Vectors ============================== ========================== Named Entity Recognition ========================== ================================== Summary ================================== |
Beta Was this translation helpful? Give feedback.
-
Hmm, I'll look into the errors related to whitespace, because I thought they could be in the middle of entities (so It looks like |
Beta Was this translation helpful? Give feedback.
-
Ah, looking again at the
When you're converting from character offsets, you don't provide the IOB or BILUO tags, you just provide the top-level label for the whole span as one unit. With what you have, it's trying to learn |
Beta Was this translation helpful? Give feedback.
-
I was just wondering how spaCy handled multi-word entities. I didn't find much information about this in the doc, and I don't know why it didn't occur to me that spaCy could just handle it natively. It makes spaCy even more awesome than I thought until now. So with top-level labels and whitespace trimming, everything seems to work just fine, training is currently running and all signals are green 👍 Thank you again @adrianeboyd for your support. |
Beta Was this translation helpful? Give feedback.
-
Closing this, as the issue seems resolved (and our GH bot seems a little out of sync ;-)) |
Beta Was this translation helpful? Give feedback.
Ah, looking again at the
debug data
output, I think you've provided your entity labels in an incorrect format. You want the character spans to cover the whole entity and you don't include theB/I-
when you specify the entity span as above:When you're converting from character offsets, you don't provide the IOB or BILUO tags, you just provide the top-level label for the whole span as one unit. With what you have, it's trying to learn
I-LIVB
as one entity type andB-LIVB
as another entity type, which isn't want you want. That would explain why it's not handling the whitespace like I …