Caused by: org.apache.spark.api.python.PythonException: 'ValueError: [E109] Component 'tok2vec' could not be run. Did you forget to call `initialize()`?'. #11847

rafaeloliveirabd · 2022-11-21T17:53:33Z

rafaeloliveirabd
Nov 21, 2022

How to reproduce the behaviour

I'm trying to apply lemmatization on several languages on azure databricks in a dataframe, however, when I use this to do the lemmatiztion:

spark = SparkSession.builder.appName('text-cleaning').getOrCreate()
sql_selection = spark.read.table("hive_metastore.default.table")
df_final = ps.DataFrame(sql_selection)

... (I call a function that cleans the text here and the next line runs) ...

out['clean_text'] = out['clean_text'].apply(lambda row: " ".join([w.lemma_.lower() for w in nlp(row)]))

I keep getting this error (in azure databricks notebook):
Caused by: org.apache.spark.api.python.PythonException: 'ValueError: [E109] Component 'tok2vec' could not be run. Did you forget to call initialize()?'.

What does this mean? I can't even access the lemmatized text by doing dataframe.iloc[2,:] for example (if throws the same error)

Also, I'm downloading the language on the same notebook doing this:

%sh python -m spacy download sv_core_news_lg

It seems like this is a spacy issue? Or am I not loading something on the azure databricks?

Your Environment

Operating System: Windows 11
Python Version Used:3.8.10
spaCy Version Used: latest version
Environment Information: azure databricks, 10.4 runtime

Answered by rafaeloliveirabd

Nov 22, 2022

It worked after I installed the spacy languages using the .whl on the azure databricks

View full answer

polm · 2022-11-22T06:01:44Z

polm
Nov 22, 2022

Normally I would expect to see that error if you were using an uninitialized tok2vec, like one that hadn't been trained. How are you creating your nlp object?

Also, I'm not familiar with databricks. Is your code running in a Windows environment there, or is that your local dev environment? Probably not relevant either way, just wanted to check.

1 reply

rafaeloliveirabd Nov 22, 2022
Author

Normally I would expect to see that error if you were using an uninitialized tok2vec, like one that hadn't been trained. How are you creating your nlp object?

Also, I'm not familiar with databricks. Is your code running in a Windows environment there, or is that your local dev environment? Probably not relevant either way, just wanted to check.

Hey, the code is being run in the cloud, not locally. I wonder if the installation of spacy or the language being used (Swedish, for example) is not being properly installed in azure databricks?

I don't have this problem locally, only on the cloud (locally I also never had to initialize tok2vec for example to get the lemmatizations going)

I'm initializing it like this:

        nlp = sp.load(f'{language}_core_web_lg')
        #save default spacy stopwords for the language loaded under nlp
        spacy_stopwords = nlp.Defaults.stop_words 
        #remove all stop words - essentially resetting the spacy set of stop words
        nlp.Defaults.stop_words -= {word for word in spacy_stopwords} 
        #adding our own english stop words
        nlp.Defaults.stop_words |= en_stop_words
        stopwords = nlp.Defaults.stop_words

rafaeloliveirabd · 2022-11-22T13:39:58Z

rafaeloliveirabd
Nov 22, 2022
Author

It worked after I installed the spacy languages using the .whl on the azure databricks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Caused by: org.apache.spark.api.python.PythonException: 'ValueError: [E109] Component 'tok2vec' could not be run. Did you forget to call `initialize()`?'. #11847

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Caused by: org.apache.spark.api.python.PythonException: 'ValueError: [E109] Component 'tok2vec' could not be run. Did you forget to call initialize()?'. #11847

Uh oh!

Uh oh!

rafaeloliveirabd Nov 21, 2022

How to reproduce the behaviour

Your Environment

Replies: 2 comments · 1 reply

Uh oh!

polm Nov 22, 2022

Uh oh!

Uh oh!

rafaeloliveirabd Nov 22, 2022 Author

Uh oh!

rafaeloliveirabd Nov 22, 2022 Author

Caused by: org.apache.spark.api.python.PythonException: 'ValueError: [E109] Component 'tok2vec' could not be run. Did you forget to call `initialize()`?'. #11847

rafaeloliveirabd
Nov 21, 2022

Replies: 2 comments 1 reply

polm
Nov 22, 2022

rafaeloliveirabd Nov 22, 2022
Author

rafaeloliveirabd
Nov 22, 2022
Author