Training data for English language models #9533

thiippal · 2021-10-24T17:44:03Z

thiippal
Oct 24, 2021

Hey!

I'm extending some of my learning materials for spaCy and came across a question that I couldn't find an answer to.

According to spaCy docs, the English language models are trained on the OntoNotes5 corpus, ClearNLP (for converting the treebanks) and WordNet. But on which data is the Morphologizer component of the English pipelines trained on? Some Universal Dependencies dataset?

adrianeboyd · 2021-10-25T06:17:38Z

adrianeboyd
Oct 25, 2021

Hi, the pretrained English pipelines don't actually contain a morphologizer. The token.pos and token.morph values come from hand-written rules that map from token.tag and other values (mainly orth and lemma) in the attribute_ruler. For the most part these rules come from the v2 tag map and morph rules, but we've made a few adjustments for v3, mainly adding some dep-based rules if a parse is available.

Not all of the UD categories can be mapped easily from the OntoNotes annotation, so there may be some errors in the results, e.g. #8856 (reply in thread). We're working on updating some of the AUX/VERB and token.morph rules to be closer to UD English treebanks for the upcoming v3.2.0 models.

2 replies

thiippal Oct 26, 2021
Author

Hi @adrianeboyd, thanks for the informative answer! I never knew that the "morphologizer" is based on hand-written rules.

Are there any plans to train UD models for English in the future, or will you stick with OntoNotes for future releases?

adrianeboyd Oct 26, 2021

To clarify, the morphologizer is a statistical component, but pretrained pipelines that do not include a morphologizer can use rules in attribute_ruler to assign token.pos and token.morph.

For licensing reasons we'd prefer to stick with OntoNotes. We'd like to train on a UD conversion of OntoNotes, but the conversion is pretty involved. I'm not aware of any stable PTB conversions for UD v2 (just UD v1) and OntoNotes has some modified NP structure plus entity information that should factor into the conversion.

If you'd like and the licenses work for your purposes, you can train a morphologizer or parser on a UD English corpus and use them instead of the existing components in the en_core pipelines. You would need to remove most of the attribute_ruler rules if you did this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training data for English language models #9533

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training data for English language models #9533

Uh oh!

thiippal Oct 24, 2021

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Oct 25, 2021

Uh oh!

thiippal Oct 26, 2021 Author

Uh oh!

adrianeboyd Oct 26, 2021

thiippal
Oct 24, 2021

Replies: 1 comment 2 replies

adrianeboyd
Oct 25, 2021

thiippal Oct 26, 2021
Author