Basic Text Categorization fails in 3.2 #9732

vrunhofen · 2021-11-22T16:25:40Z

vrunhofen
Nov 22, 2021

A very basic test for Text Categorization fails.

import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a sentence.")
textcat = nlp.add_pipe("textcat")
processed = textcat(doc)

I get the following error:

Traceback (most recent call last):                                                                                                                                                                      
  File "<stdin>", line 1, in <module>                                                                                                                                                                   
  File "spacy/pipeline/trainable_pipe.pyx", line 56, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__                                                                                            
  File "/usr/local/lib/python3.9/site-packages/spacy/util.py", line 1618, in raise_error                                                                                                                
    raise e                                                                                                                                                                                             
  File "spacy/pipeline/trainable_pipe.pyx", line 52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__                                                                                            
  File "/usr/local/lib/python3.9/site-packages/spacy/pipeline/textcat.py", line 191, in predict                                                                                                         
    scores = self.model.predict(docs)                                                                                                                                                                   
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 315, in predict                                                                                                                    
    return self._func(self, X, is_train=False)[0]                                                                                                                                                       
  File "/usr/local/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward                                                                                                              
    Y, inc_layer_grad = layer(X, is_train=is_train)                                                                                                                                                     
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 291, in __call__                                                                                                                   
    return self._func(self, X, is_train=is_train)                                                                                                                                                       
  File "/usr/local/lib/python3.9/site-packages/thinc/layers/concatenate.py", line 44, in forward                                                                                                        
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])                                                                                                                       
  File "/usr/local/lib/python3.9/site-packages/thinc/layers/concatenate.py", line 44, in <listcomp>                                                                                                     
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])                                                                                                                       
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 291, in __call__                                                                                                                   
    return self._func(self, X, is_train=is_train)                                                                                                                                                       
  File "/usr/local/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward                                                                                                              
    Y, inc_layer_grad = layer(X, is_train=is_train)                                                                                                                                                     
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 291, in __call__                                                                                                                   
    return self._func(self, X, is_train=is_train)                                                                                                                                                       
  File "/usr/local/lib/python3.9/site-packages/thinc/layers/with_cpu.py", line 24, in forward                                                                                                           
    cpu_outputs, backprop = model.layers[0].begin_update(_to_cpu(X))                                                                                                                                    
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 309, in begin_update                                                                                                               
    return self._func(self, X, is_train=True)                                                                                                                                                           
  File "/usr/local/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward                                                                                                              
    Y, inc_layer_grad = layer(X, is_train=is_train)                                                                                                                                                     
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 291, in __call__                                                                                                                   
    return self._func(self, X, is_train=is_train)                                                                                                                                                       
  File "/usr/local/lib/python3.9/site-packages/thinc/layers/resizable.py", line 27, in forward                                                                                                          
    Y, callback = layer(X, is_train=is_train)                                                                                                                                                           
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 291, in __call__                                                                                                                   
    return self._func(self, X, is_train=is_train)                                                                                                                                                       
  File "thinc/layers/sparselinear.pyx", line 44, in thinc.layers.sparselinear.forward                                                                                                                   
  File "thinc/layers/sparselinear.pyx", line 69, in thinc.layers.sparselinear._begin_cpu_update                                                                                                         
  File "/usr/local/lib/python3.9/site-packages/thinc/model.py", line 175, in get_dim                                                                                                                    
    raise ValueError(err)                                                                                                                                                                               
ValueError: Cannot get dimension 'nO' for model 'sparse_linear': value unset

I tried using en_core_web_sm, and another piece of code dervied from the API docs as follows. I still get the same error.

nlp = spacy.load("en_core_web_sm")
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
config = {
   "threshold": 0.5,
   "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
}
nlp.add_pipe("textcat", config=config)
nlp("spacy.io usage spacy-101")

Info about spaCy

spaCy version: 3.2.0
Platform: macOS-10.15.7-x86_64-i386-64bit
Python version: 3.9.5
Pipelines: en_core_web_md (3.2.0), en_core_web_sm (3.2.0)

Answered by adrianeboyd

Nov 23, 2021

The textcat component that you've just added this way hasn't been initialized (it doesn't even know which labels it's supposed to predict) and it hasn't been trained. I'd suggest having a look at the training docs (https://spacy.io/usage/spacy-101#training, https://spacy.io/usage/training) and trying out a textcat demo project (spacy project clone pipelines/textcat_demo, https://spacy.io/usage/projects)

View full answer

adrianeboyd · 2021-11-23T07:21:27Z

adrianeboyd
Nov 23, 2021

The textcat component that you've just added this way hasn't been initialized (it doesn't even know which labels it's supposed to predict) and it hasn't been trained. I'd suggest having a look at the training docs (https://spacy.io/usage/spacy-101#training, https://spacy.io/usage/training) and trying out a textcat demo project (spacy project clone pipelines/textcat_demo, https://spacy.io/usage/projects)

3 replies

vrunhofen Nov 28, 2021
Author

Thank you Adriane.
I successfully trained a textcat_demo project from here.

Here's some of the output:

=========================== Initializing pipeline ===========================
[2021-11-28 01:37:23,672] [INFO] Set up nlp object from config
[2021-11-28 01:37:23,709] [INFO] Pipeline: ['textcat']
[2021-11-28 01:37:23,725] [INFO] Created vocabulary
[2021-11-28 01:37:23,729] [INFO] Finished initializing nlp object
[2021-11-28 01:37:46,935] [INFO] Initialized pipeline components: ['textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.50       41.18    0.41
  0     100         20.82       60.32    0.60
  0     200         13.18       84.78    0.85
  1     300         12.70       85.61    0.86
  1     400          6.05       85.29    0.85
  2     500          5.23       86.72    0.87
  2     600          2.69       85.05    0.85
  3     700          4.10       87.00    0.87
  3     800          0.08       84.45    0.84
  4     900          0.02       87.79    0.88
  4    1000          0.00       86.93    0.87
✔ Saved pipeline to output directory
training/model-last

================================== evaluate ==================================
Running command: /usr/local/opt/[email protected]/bin/python3.9 -m spacy evaluate training/model-best corpus/dev.spacy --output training/metrics.json
ℹ Using CPU

================================== Results ==================================

TOK                 100.00
TEXTCAT (macro F)   87.79 
SPEED               13478 


=========================== Textcat F (per label) ===========================

                    P       R       F
DOCUMENTATION   83.64   83.64   83.64
OTHER           91.94   91.94   91.94


======================== Textcat ROC AUC (per label) ========================

                ROC AUC
DOCUMENTATION      0.94
OTHER              0.94

✔ Saved results to training/metrics.json

Now I'm trying to load up the model and test it with some example sentences, but I can't seem to get anywhere. Is this the right way to go about this:

nlp = spacy.load("/Users/vrun/Desktop/textcat_demo/training/model-best")
nlp("what does this eval to?")

But that just echoes back the input sentence.
How do I check the textcat results?

polm Nov 29, 2021

But that just echoes back the input sentence.
How do I check the textcat results?

The categories are saved in doc.cats, you can use code like this to see them:

doc = nlp("this is some text")
print(doc.cats)

When you pass text to the nlp object it creates a Doc, that doesn't change based on your pipeline. What changes is that different components can assign different attributes.

vrunhofen Nov 29, 2021
Author

Got it. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Basic Text Categorization fails in 3.2 #9732

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Basic Text Categorization fails in 3.2 #9732

Uh oh!

Uh oh!

vrunhofen Nov 22, 2021

Info about spaCy

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Nov 23, 2021

Uh oh!

vrunhofen Nov 28, 2021 Author

Uh oh!

polm Nov 29, 2021

Uh oh!

vrunhofen Nov 29, 2021 Author

vrunhofen
Nov 22, 2021

Replies: 1 comment 3 replies

adrianeboyd
Nov 23, 2021

vrunhofen Nov 28, 2021
Author

vrunhofen Nov 29, 2021
Author