@@ -30,10 +30,16 @@ into three components:
3030 tagging, parsing, lemmatization and named entity recognition, or ` dep ` for
3131 only tagging, parsing and lemmatization).
32322 . ** Genre:** Type of text the pipeline is trained on, e.g. ` web ` or ` news ` .
33- 3 . ** Size:** Package size indicator, ` sm ` , ` md ` , ` lg ` or ` trf ` (` sm ` : no word
34- vectors, ` md ` : reduced word vector table with 20k unique vectors for ~ 500k
35- words, ` lg ` : large word vector table with ~ 500k entries, ` trf ` : transformer
36- pipeline without static word vectors)
33+ 3 . ** Size:** Package size indicator, ` sm ` , ` md ` , ` lg ` or ` trf ` .
34+
35+ ` sm ` and ` trf ` pipelines have no static word vectors.
36+
37+ For pipelines with default vectors, ` md ` has a reduced word vector table with
38+ 20k unique vectors for ~ 500k words and ` lg ` has a large word vector table
39+ with ~ 500k entries.
40+
41+ For pipelines with floret vectors, ` md ` vector tables have 50k entries and
42+ ` lg ` vector tables have 200k entries.
3743
3844For example, [ ` en_core_web_sm ` ] ( /models/en#en_core_web_sm ) is a small English
3945pipeline trained on written web text (blogs, news, comments), that includes
@@ -90,19 +96,42 @@ Main changes from spaCy v2 models:
9096In the ` sm ` /` md ` /` lg ` models:
9197
9298- The ` tagger ` , ` morphologizer ` and ` parser ` components listen to the ` tok2vec `
93- component.
99+ component. If the lemmatizer is trainable (v3.3+), ` lemmatizer ` also listens
100+ to ` tok2vec ` .
94101- The ` attribute_ruler ` maps ` token.tag ` to ` token.pos ` if there is no
95102 ` morphologizer ` . The ` attribute_ruler ` additionally makes sure whitespace is
96103 tagged consistently and copies ` token.pos ` to ` token.tag ` if there is no
97104 tagger. For English, the attribute ruler can improve its mapping from
98105 ` token.tag ` to ` token.pos ` if dependency parses from a ` parser ` are present,
99106 but the parser is not required.
100- - The ` lemmatizer ` component for many languages (Catalan, Dutch, English,
101- French, Greek, Italian Macedonian, Norwegian, Polish and Spanish) requires
102- ` token.pos ` annotation from either ` tagger ` +` attribute_ruler ` or
103- ` morphologizer ` .
107+ - The ` lemmatizer ` component for many languages requires ` token.pos ` annotation
108+ from either ` tagger ` +` attribute_ruler ` or ` morphologizer ` .
104109- The ` ner ` component is independent with its own internal tok2vec layer.
105110
111+ #### CNN/CPU pipelines with floret vectors
112+
113+ The Finnish, Korean and Swedish ` md ` and ` lg ` pipelines use
114+ [ floret vectors] ( /usage/v3-2#vectors ) instead of default vectors. If you're
115+ running a trained pipeline on texts and working with [ ` Doc ` ] ( /api/doc ) objects,
116+ you shouldn't notice any difference with floret vectors. With floret vectors no
117+ tokens are out-of-vocabulary, so [ ` Token.is_oov ` ] ( /api/token#attributes ) will
118+ return ` True ` for all tokens.
119+
120+ If you access vectors directly for similarity comparisons, there are a few
121+ differences because floret vectors don't include a fixed word list like the
122+ vector keys for default vectors.
123+
124+ - If your workflow iterates over the vector keys, you need to use an external
125+ word list instead:
126+
127+ ``` diff
128+ - lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
129+ + lexemes = [nlp.vocab[word] for word in external_word_list]
130+ ```
131+
132+ - [ ` Vectors.most_similar ` ] ( /api/vectors#most_similar ) is not supported because
133+ there's no fixed list of vectors to compare your vectors to.
134+
106135### Transformer pipeline design {#design-trf}
107136
108137In the transformer (` trf ` ) models, the ` tagger ` , ` parser ` and ` ner ` (if present)
@@ -133,10 +162,14 @@ nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemma
133162<Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require
134163Token.pos">
135164
136- The lemmatizer depends on ` tagger ` +` attribute_ruler ` or ` morphologizer ` for
137- Catalan, Dutch, English, French, Greek, Italian, Macedonian, Norwegian, Polish
138- and Spanish. If you disable any of these components, you'll see lemmatizer
139- warnings unless the lemmatizer is also disabled.
165+ The lemmatizer depends on ` tagger ` +` attribute_ruler ` or ` morphologizer ` for a
166+ number of languages. If you disable any of these components, you'll see
167+ lemmatizer warnings unless the lemmatizer is also disabled.
168+
169+ ** v3.3** : Catalan, English, French, Russian and Spanish
170+
171+ ** v3.0-v3.2** : Catalan, Dutch, English, French, Greek, Italian, Macedonian,
172+ Norwegian, Polish, Russian and Spanish
140173
141174</Infobox >
142175
@@ -154,10 +187,34 @@ nlp.enable_pipe("senter")
154187The ` senter ` component is ~ 10× ; faster than the parser and more accurate
155188than the rule-based ` sentencizer ` .
156189
190+ #### Switch from trainable lemmatizer to default lemmatizer
191+
192+ Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether
193+ the lemmatizer is trainable:
194+
195+ ``` python
196+ nlp = spacy.load(" de_core_web_sm" )
197+ assert nlp.get_pipe(" lemmatizer" ).is_trainable
198+ ```
199+
200+ If you'd like to switch to a non-trainable lemmatizer that's similar to v3.2 or
201+ earlier, you can replace the trainable lemmatizer with the default non-trainable
202+ lemmatizer:
203+
204+ ``` python
205+ # Requirements: pip install spacy-lookups-data
206+ nlp = spacy.load(" de_core_web_sm" )
207+ # Remove existing lemmatizer
208+ nlp.remove_pipe(" lemmatizer" )
209+ # Add non-trainable lemmatizer from language defaults
210+ # and load lemmatizer tables from spacy-lookups-data
211+ nlp.add_pipe(" lemmatizer" ).initialize()
212+ ```
213+
157214#### Switch from rule-based to lookup lemmatization
158215
159216For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
160- pipelines, you can switch from the default rule-based lemmatizer to a lookup
217+ pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup
161218lemmatizer:
162219
163220``` python
0 commit comments