Training one's own model with Prodigy + misc questions #11029
Replies: 1 comment
-
To address just some points...
The Spanish models do not use rule-based tagging. The post you link to states they have a rule-based tokenizer and lemmatizer. The Spanish models use a Morphologizer instead of a Tagger, so they predict Univeral Dependencies coarse POS tags (aka UPOS tags), and use an attribute ruler to set fine-grained POS tags from that. I'm not specifically familiar with the data for the Spanish pipeline but this is usually done when the training data only has UPOS tags. It may be helpful to keep in mind that UPOS tags the Morphologizer sets correspond to The issue with Prodigy saying you have no labels and generally not working is probably because the training data has nothing in the
The Morphologizer can be trained like any component, by setting the values on your training data. For lemmas you can use the EditTreeLemmatizer.
NER models don't directly use POS predictions. They only way they can interact is through a shared tok2vec, and it's usually not beneficial to do that, and doesn't have a large effect in any direction. There are a couple of previous threads on this, like #9641. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
It seems that for all of the Spanish pretrained models, there is no
tagger
element, so it cannot be trained using Prodigy's pos.teach feature. Does it have to do with the fact that Spanish models have rule-based tagging?The behavior of pos.correct is also a little odd to me. When specifying a label with
prodigy pos.correct POS_correct es_core_news_lg ./data.json -l PROPN,NOUN,INTJ -U
I was only able to find examples of
INTJ
andPROPN
highlighted in the dataset. Obvious examples of nouns that should have been highlighted were not. I think this is due to having the morphology appended to the POS tag for the fine-grained information.This command
prodigy pos.correct POS_correct es_core_news_lg ./data.json -U
resulted in an error
✘ No --label argument set and no labels found in model
So,
How would one go about improving the POS tagging for these models?
pos.correct
as normal?pos.teach
on these models?Is there any way to train morphology predictions? Lemma predictions?
Does the POS tagging of words influence the predictions that an NER model makes after training? How much can incorrect POS tagging effect the accuracy of a model?
When you run
prodigy train
, is this just the default train settings for Spacy in a different wrapping?And a more open ended question- I'm starting to get to the point where it's time to fine tune my training. Word vectors, transformers, etc., are all beyond my knowledge. A post here suggested a few resources, but I found the linked guides, for example this one, still above my level.
What resources are recommended in the community to demystify this?
Beta Was this translation helpful? Give feedback.
All reactions