spaCy NER model as PySpark UDF #11287

dave-espinosa · 2022-08-09T21:01:48Z

dave-espinosa
Aug 9, 2022

Hello everyone,

I was doing a quick implementation, where I was using a NER model trained in spaCy, to implement a PySpark udf, in which a listing of the extracted entities from a text is attempted. I am trying that approach, as I discovered a while ago that spaCy, even with the best-practices for speed-optimization applied, can become a considerable bottleneck, for handling "Big Data - like" requests. Back then, we kinda solved the issue by setting a lot of VMs to mine chunks of data, but we are looking for different ways to tackle the same problem, but faster this time.

I am thinking something about this, but I don't know:

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, StringType

spark = SparkSession.builder.appName('SpacyOverPySpark').getOrCreate()

# Reading data
# HINT: This dataframe contains 1M rows of text, under a single column: "posting"
df = spark.read.csv("gs://hidden_bucket/1M_samples.csv", header=True)

# Example spaCy NER UDF
def get_skills(text:str) -> list:
    doc = nlp(txt)
    return [ent.text for ent in doc.ents]

# Building a User Defined Function (UDF)
getskillsUDF = udf(lambda z: get_skills(z), ArrayType(StringType()))

# "Applying" a UDF on a Spark Dataframe
df.withColumn("entities", getskillsUDF(col("posting")))

When running the previous code (from a Dataproc Workbench Jupyter Notebook), it fails with the following message: The kernel from GCS/ml_experiments/spacy_over_pyspark.ipynb appears to have died. It will restart automatically.; intuitively I know that I am getting this, because spaCy itself could be becoming a bottleneck inside PySpark. Needless to say, I know the errors and the possible explanations, but still, the implementation does not work.

Has anyone tried something similar? Any recommendation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

spaCy NER model as PySpark UDF #11287

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

spaCy NER model as PySpark UDF #11287

Uh oh!

Uh oh!

dave-espinosa Aug 9, 2022

Replies: 0 comments

dave-espinosa
Aug 9, 2022