Deeper Understanding on POS-Tags and their differences between models #12599

AnnemarieWittig · 2023-05-06T10:22:38Z

AnnemarieWittig
May 6, 2023

Hi!

I am trying to gain a deeper understanding of the usage of POS-Tags, specifically in the German de_core_news_sm and English en_core_web_sm models. The only explanation on the used POS-Tags I have yet found is this part here: https://github.com/explosion/spaCy/blob/8f058e39bd95da1f14d0071452b4d58103014dc7/spacy/glossary.py

Is there a better explanation somewhere? Can I also find a "mapping" or insight into when which tags are used in the different models and why? I'd also be interested in how the tags "compare" to each other. For example, the "compound" tag used in en_core_web_sm seems to describe a similar relation to both pnc and nmc tags in the German model. I couldn't find much on my own, so I'd appreciate some pointers!

Answered by adrianeboyd

May 8, 2023

Each model lists sources in nlp.meta and meta.json and on the pages under https://spacy.io/models. The fine-grained tags in token.tag are usually language-specific and frequently also corpus-specific, so you can find more information in the corpus documentation. English uses the PTB tagset and German uses the STTS tagset.

In contrast, token.pos uses universal POS tags from the Universal Dependencies project, which are the same across all languages. They're not 100% 1-to-1 for every single corpus/language in every possible detailed case, but they're used relatively consistently across languages.

View full answer

adrianeboyd · 2023-05-08T06:09:41Z

adrianeboyd
May 8, 2023

Each model lists sources in nlp.meta and meta.json and on the pages under https://spacy.io/models. The fine-grained tags in token.tag are usually language-specific and frequently also corpus-specific, so you can find more information in the corpus documentation. English uses the PTB tagset and German uses the STTS tagset.

In contrast, token.pos uses universal POS tags from the Universal Dependencies project, which are the same across all languages. They're not 100% 1-to-1 for every single corpus/language in every possible detailed case, but they're used relatively consistently across languages.

2 replies

AnnemarieWittig May 9, 2023
Author

Thank you for the reaction and clarification!
So just to be clear, if I iterate over the child-relations of my tokens like this:

for child in token.children:

I am using the language/corpus specific tokens?

adrianeboyd May 9, 2023

For English and German the structure of the dependency trees and dependency labels are corpus-specific.

Most other trained pipelines use UD v2 dependency parses and POS tags from UD treebanks, although they still vary a small amount between languages and corpora, too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deeper Understanding on POS-Tags and their differences between models #12599

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Deeper Understanding on POS-Tags and their differences between models #12599

Uh oh!

AnnemarieWittig May 6, 2023

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd May 8, 2023

Uh oh!

Uh oh!

AnnemarieWittig May 9, 2023 Author

Uh oh!

adrianeboyd May 9, 2023

AnnemarieWittig
May 6, 2023

Replies: 1 comment 2 replies

adrianeboyd
May 8, 2023

AnnemarieWittig May 9, 2023
Author