How to pickle a ".spacy" dataset #10691
-
Hello everyone. I am currently implementing a Vertex AI Pipeline to train a spaCy model. Currently, I need to pass the "train.spacy" and "dev.spacy" datasets, from a "Preprocess" component, to a "Train" component (this is only general info, does not matter much if you're not familiar with Vertex AI Pipelines). However, the only way in which this can be achieved, is converting those ".spacy" files, into pickle. I have seen only similar approaches, however nothing close. Do you have any idea about how to do that? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
It surprises me that you could only use pickle files as artefacts in a tool like Google Vertex. Are you sure you're not simply able to store the |
Beta Was this translation helpful? Give feedback.
-
Hello everyone, I have found a workaround for my use case, and as forecasted in the discussion thread (by both @koaning and me), there is no need to pickle a I will include a very simple example about how to do it, borrowing the code from this post in Stack Overflow: import spacy
from spacy.training import Example
from spacy.tokens import DocBin
td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]
nlp = spacy.blank("en")
db = DocBin()
for text, annotations in td:
example = Example.from_dict(nlp.make_doc(text), annotations)
db.add(example.reference)
db.to_disk("gcs/your_bucket_name/your_path/td.spacy") # <== THIS DOES THE TRICK UPDATE: It is important to notice that, even when the previous implementation would allow you to store a Thank you. |
Beta Was this translation helpful? Give feedback.
Hello everyone,
I have found a workaround for my use case, and as forecasted in the discussion thread (by both @koaning and me), there is no need to pickle a
.spacy
file, in order to use it as input / output of a component, inside a Vertex AI Pipeline. If the project you are currently using, is the same to access both Vertex AI and Cloud Storage (as in my case), you can access ANY GCS Bucket from Vertex AI, by addinggcs/
first in your search path, as explained here.I will include a very simple example about how to do it, borrowing the code from this post in Stack Overflow: