Serialization of doc #9529
-
Serialization lossAfter serializing a spacy doc with jsonpickle, the matcher has stopped working. How to reproduce the behaviour:
At this point doc_bytes and vocab bytes will be sent in a json to another service
There is no match:
UPDATE:is_left_punct/is_right_punct are not included in the attributes which is causing the error. Is there any way to ensure that some attributes are ensured to be saved? lemma and lower attributes in a pattern do not work either. Is this an issue with the vocab? Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @nicno90 !
Can you try saving your dataset as a |
Beta Was this translation helpful? Give feedback.
-
This line is the problem: vocab = Vocab().from_bytes(jsonpickle.decode(vocab_bytes)) This is not creating the same English vocab as from the "en" pipeline, but a vocab for an unspecified language with no defaults for things like vocab = spacy.blank("en").vocab.from_bytes(...) # or English().vocab ... If you're using
Unlike |
Beta Was this translation helpful? Give feedback.
This line is the problem:
This is not creating the same English vocab as from the "en" pipeline, but a vocab for an unspecified language with no defaults for things like
is_left_punct
. To restore the same vocab, you want:If you're using
en_core_web_trf
and not customizing any lexical features (cluster, norm, prob, and sentiment are the main ones that would be saved with the vocab), you don't really need to save the vocab at all.spacy.blank("en").vocab
will have the same lexical attributes asen_core_web_trf.vocab
. Do be aware that the lexical attributes can …