Errors in spaCy train (spacy-nightly) #6472
-
I encounter the following error when running command:
My config.cfg file is like this:
So what's a corpus and what does it do in training spacy model? Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 2 replies
-
A corpus is a dataset containing gold-standard annotations to train (and evaluate) a ML model. Typically, a It looks like your config probably contains a section like this:
This means that you'll need to define the variable
If you don't have a separate dev set, you can reuse the training set. Just be aware that it'll be more difficult to spot overfitting. Overfitting generally occurs when the ML model learns to replicate the training dataset "too well", and starts to lose its generalization capability. You can typically see this during training when the training loss keeps improving (i.e. decreasing) while the dev performance starts getting worse. You won't be able to see that if you don't have a separate dev set. But technically, the training will still run properly. |
Beta Was this translation helpful? Give feedback.
-
Thanks. I add --path.dev to the command and the above issue is resolved. However, I encounter another error:
|
Beta Was this translation helpful? Give feedback.
-
Can you provide the full config file? |
Beta Was this translation helpful? Give feedback.
-
This is my config file for CPU preferred:
This is my config file for GPU preferred and the error is as following:
|
Beta Was this translation helpful? Give feedback.
-
Also, I would like to train on en_core_web_lg model, instead of a blank en model, where can I add that to the config file? Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi! As I told you on the Issue tracker, we converted this issue to a thread on the discussion board, as this is more of a place where the community can come together, help eachother, discuss best practices, and so on :-) Anyway I selected my answer to your first question as the correct answer as that solved your original problem. Going forward, with this discussion board and being able to select "correct answers", it might make sense to open new threads for new issues, as that also helps others to understand quickly what the current problem is and whether they can help. With respect to your second question: I copy-pasted your entire config file and ran it with the command
and this works fine on my end:
I assume the problem is with your data. Are you sure there is valid data in your .spacy files? Can you try reading them in with the docbin class and see what's there?
|
Beta Was this translation helpful? Give feedback.
-
AssertionError: [E923] It looks like there is no proper sample data to initialize the Model of component 'tok2vec'. To check your input data paths and annotation, run: python -m spacy debug data config.cfg I'm getting this error while running the train command. And when I run the debug command, I'm getting an error as follows : Please help. |
Beta Was this translation helpful? Give feedback.
A corpus is a dataset containing gold-standard annotations to train (and evaluate) a ML model. Typically, a
Corpus
registered function is used, as seems to be the case in your config (as I can deduce from the error message).It looks like your config probably contains a section like this:
This means that you'll need to define the variable
dev
from the section[paths]
. You can keep this atnull
and override on the CLI, exactly as you've done for the variabletrain
. So you'll need something like this: