Does vocab/strings.json essential for name entity recognition? #7794
-
Hi, I currently work on ner projects using spacy in a healthcare context (so with sensible data). However, we found that, after training, the model contains the file vocab/strings.json which contains words seen during the training and then sensible data: model/ We were wondering if this file was essential and if it would be possible to not save it? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Sorry for the delayed response! There are many general concerns about ML models leaking sensitive information, so in general we'd suggest trying to find an alternative to training directly on sensitive data. That said, a technical answer: For the built-in pipeline components, I think everything should keep working even if you remove all the strings from the string store. The general setup between a pipeline component and the string store is that a pipeline component can expect strings that it added to the string store to be there in the future, so some components might break if their labels are removed from the string store. As far as I can tell all the built-in components do re-add their own labels if needed, but there's no requirement that a pipeline component do this. If you use any vector similarity calculations with provided vectors, you may also run into problems if the strings for the vectors are removed from the string store. You'd want to test this for your task, but in general I think it should typically be fine to remove everything except:
I ran some quick tests on |
Beta Was this translation helpful? Give feedback.
Sorry for the delayed response!
There are many general concerns about ML models leaking sensitive information, so in general we'd suggest trying to find an alternative to training directly on sensitive data.
That said, a technical answer:
For the built-in pipeline components, I think everything should keep working even if you remove all the strings from the string store.
The general setup between a pipeline component and the string store is that a pipeline component can expect strings that it added to the string store to be there in the future, so some components might break if their labels are removed from the string store. As far as I can tell all the built-in components do re-add their…