Does vocab/strings.json essential for name entity recognition? #7794

a-t-richard · 2021-04-15T09:37:38Z

a-t-richard
Apr 15, 2021

Hi, I currently work on ner projects using spacy in a healthcare context (so with sensible data).

However, we found that, after training, the model contains the file vocab/strings.json which contains words seen during the training and then sensible data:

We were wondering if this file was essential and if it would be possible to not save it?

Answered by adrianeboyd

May 4, 2021

Sorry for the delayed response!

There are many general concerns about ML models leaking sensitive information, so in general we'd suggest trying to find an alternative to training directly on sensitive data.

That said, a technical answer:

For the built-in pipeline components, I think everything should keep working even if you remove all the strings from the string store.

The general setup between a pipeline component and the string store is that a pipeline component can expect strings that it added to the string store to be there in the future, so some components might break if their labels are removed from the string store. As far as I can tell all the built-in components do re-add their…

View full answer

adrianeboyd · 2021-05-04T08:01:19Z

adrianeboyd
May 4, 2021

Sorry for the delayed response!

There are many general concerns about ML models leaking sensitive information, so in general we'd suggest trying to find an alternative to training directly on sensitive data.

That said, a technical answer:

For the built-in pipeline components, I think everything should keep working even if you remove all the strings from the string store.

The general setup between a pipeline component and the string store is that a pipeline component can expect strings that it added to the string store to be there in the future, so some components might break if their labels are removed from the string store. As far as I can tell all the built-in components do re-add their own labels if needed, but there's no requirement that a pipeline component do this.

If you use any vector similarity calculations with provided vectors, you may also run into problems if the strings for the vectors are removed from the string store.

You'd want to test this for your task, but in general I think it should typically be fine to remove everything except:

pipeline component labels
strings for vectors, mainly relevant if vectors are used directly (not just as features for statistical models)

I ran some quick tests on en_core_web_sm after removing all the strings and the basics seemed fine, but sometimes errors about missing strings only crop up once you start working with the annotations in detail.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Does vocab/strings.json essential for name entity recognition? #7794

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Does vocab/strings.json essential for name entity recognition? #7794

Uh oh!

a-t-richard Apr 15, 2021

Replies: 1 comment

Uh oh!

adrianeboyd May 4, 2021

a-t-richard
Apr 15, 2021

adrianeboyd
May 4, 2021