Evaluation

Evaluation of idscrub

We want to quantify how well idscrub removes personally identifiable information (PII).
Our aims are to:
- supplement our existing unit tests
- verify and supplement existing benchmarking of SpaCy models.

idscrub has been evaluated using a dummy dataset.
The dummy dataset contains contains 50 rows of text, each containing different types of personal data that should be scrubbed by idscrub.

We compare the scrubbed PII with the PII that we know should be removed to generate an accuracy score (the % of PII removed by the package)
This is disaggregated by PII-type. For example, the % of email addresses removed.

This table shows the proportion of different PII successfully extracted from the dataset.

Regex methods are generally excellent at scrubbing the PII they were designed to remove, with 100% accuracy observed.
Our default name-removal NER model (en_core_web_trf) has reduced accuracy compared to regex, but it is still extremely high by name-removal standards (96%), exceeding SpaCy's own benchmark value of 89.8% for NER with this model.
Our evaluation does reveal that non-English language names tend to be missed, along with names with initials e.g. John D.
This should be taken into account when using idscrub, and mitigations put in place to address this if the dataset is highly sensitive.
We should also note that the input data here was 'clean', i.e. the sentences were not broken or technical. Results may vary if data is in a more complex and fragmented format.

We want to quantify how well idscrub removes names of different types (first names, surnames, names in different languages).
Our aims are to:
- supplement our existing unit tests
- verify and supplement existing benchmarking of SpaCy models.

idscrub name removal has been evaluated using a fake names from Faker.
We sampled unique names in different languages and from countries (such as English-language British and Indian names, and Hungarian and Polish names).

We run the names through idscrub.remove_spacy_persons() and measure the percent of names removed.
We pass the names with no context and in lowercase, as this represents the most difficult input for transformer-based NER models.
This is disaggregated by first name, second name, full name and language/location origin of the name.

Across the whole dataset:
- full names are scrubbed ~94% of the time
- first names are scrubbed ~78% of the time
- second names are scrubbed ~70% of the time.
This varies considerably by the linguistic source or language of the name, with Hungarian, British and Spanish surnames having the lowest scrub rate (< 60%).
In terms of full names, all languages have scrub rates of > 90%, except French and Spanish names.

These results show that name scrubbing struggles without context and when only the first name or second name is present, however a complete lack of context is unlikely in 'real' text.
This reiterates the importance of maintaining semantic meaning (e.g. I am [NAME]) and cases (case-typical names, e.g. John Smith) in the text you pass to idscrub as these names are identified much more readily by transformer-based NER models.
These results also show that name origin matters, but not in an obviously systematic way, with Nigerian and Croatian full names identified more readily than Spanish and French full names.
British surnames are surprisingly poorly identified owning to their commonalities with non-name words, such as 'hill', 'wood' and 'west'. Similar phenomena occurs with Nigerian (English) names, such as 'faith', 'peace' and 'blessing'.