spaCy NER model as PySpark UDF #11287
dave-espinosa
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I was doing a quick implementation, where I was using a NER model trained in spaCy, to implement a PySpark udf, in which a listing of the extracted entities from a text is attempted. I am trying that approach, as I discovered a while ago that spaCy, even with the best-practices for speed-optimization applied, can become a considerable bottleneck, for handling "Big Data - like" requests. Back then, we kinda solved the issue by setting a lot of VMs to mine chunks of data, but we are looking for different ways to tackle the same problem, but faster this time.
I am thinking something about this, but I don't know:
When running the previous code (from a Dataproc Workbench Jupyter Notebook), it fails with the following message:
The kernel from GCS/ml_experiments/spacy_over_pyspark.ipynb appears to have died. It will restart automatically.
; intuitively I know that I am getting this, because spaCy itself could be becoming a bottleneck inside PySpark. Needless to say, I know the errors and the possible explanations, but still, the implementation does not work.Has anyone tried something similar? Any recommendation?
Beta Was this translation helpful? Give feedback.
All reactions