Deeper Understanding on POS-Tags and their differences between models #12599
-
Hi! I am trying to gain a deeper understanding of the usage of POS-Tags, specifically in the German de_core_news_sm and English en_core_web_sm models. The only explanation on the used POS-Tags I have yet found is this part here: https://github.com/explosion/spaCy/blob/8f058e39bd95da1f14d0071452b4d58103014dc7/spacy/glossary.py Is there a better explanation somewhere? Can I also find a "mapping" or insight into when which tags are used in the different models and why? I'd also be interested in how the tags "compare" to each other. For example, the "compound" tag used in en_core_web_sm seems to describe a similar relation to both pnc and nmc tags in the German model. I couldn't find much on my own, so I'd appreciate some pointers! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Each model lists sources in In contrast, |
Beta Was this translation helpful? Give feedback.
Each model lists sources in
nlp.meta
andmeta.json
and on the pages under https://spacy.io/models. The fine-grained tags intoken.tag
are usually language-specific and frequently also corpus-specific, so you can find more information in the corpus documentation. English uses the PTB tagset and German uses the STTS tagset.In contrast,
token.pos
uses universal POS tags from the Universal Dependencies project, which are the same across all languages. They're not 100% 1-to-1 for every single corpus/language in every possible detailed case, but they're used relatively consistently across languages.