-
Notifications
You must be signed in to change notification settings - Fork 2
Evaluation
Euan Soutter edited this page Dec 5, 2025
·
1 revision
- We want to quantify how well
idscrubremoves personally identifiable information (PII). - Our aims are to:
- supplement our existing unit tests
- verify and supplement existing benchmarking of SpaCy models.
-
idscrubhas been evaluated using a dummy dataset. - The dummy dataset contains contains 50 rows of text, each containing different types of personal data that should be scrubbed by
idscrub.
- We compare the scrubbed PII with the PII that we know should be removed to generate an accuracy score (the % of PII removed by the package)
- This is disaggregated by PII-type. For example, the % of email addresses removed.
- This table shows the proportion of different PII successfully extracted from the dataset.
| Data Type | Proportion Found |
|---|---|
| :person: Names | 96% |
| 🏷️ Titles | 100% |
| 📧 Email Addresses | 100% |
| 💬 Handles | 100% |
| 📞 Phone Numbers | 100% |
| 📮 Postcodes | 100% |
| 🌐 IP Addresses | 100% |
- Regex methods are generally excellent at scrubbing the PII they were designed to remove, with 100% accuracy observed.
- Our default name-removal NER model (en_core_web_trf) has reduced accuracy compared to regex, but it is still extremely high by name-removal standards (96%), exceeding SpaCy's own benchmark value of 89.8% for NER with this model.
- Our evaluation does reveal that non-English language names tend to be missed, along with names with initials e.g. John D.
- This should be taken into account when using
idscrub, and mitigations put in place to address this if the dataset is highly sensitive. - We should also note that the input data here was 'clean', i.e. the sentences were not broken or technical. Results may vary if data is in a more complex and fragmented format.
- We want to quantify how well
idscrubremoves names of different types (first names, surnames, names in different languages). - Our aims are to:
- supplement our existing unit tests
- verify and supplement existing benchmarking of SpaCy models.
-
idscrubname removal has been evaluated using a fake names from Faker. - We sampled unique names in different languages and from countries (such as English-language British and Indian names, and Hungarian and Polish names).
- We run the names through
idscrub.remove_spacy_persons()and measure the percent of names removed. - We pass the names with no context and in lowercase, as this represents the most difficult input for transformer-based NER models.
- This is disaggregated by first name, second name, full name and language/location origin of the name.
- Across the whole dataset:
- full names are scrubbed ~94% of the time
- first names are scrubbed ~78% of the time
- second names are scrubbed ~70% of the time.
- This varies considerably by the linguistic source or language of the name, with Hungarian, British and Spanish surnames having the lowest scrub rate (< 60%).
- In terms of full names, all languages have scrub rates of > 90%, except French and Spanish names.
- These results show that name scrubbing struggles without context and when only the first name or second name is present, however a complete lack of context is unlikely in 'real' text.
- This reiterates the importance of maintaining semantic meaning (e.g. I am [NAME]) and cases (case-typical names, e.g. John Smith) in the text you pass to
idscrubas these names are identified much more readily by transformer-based NER models. - These results also show that name origin matters, but not in an obviously systematic way, with Nigerian and Croatian full names identified more readily than Spanish and French full names.
- British surnames are surprisingly poorly identified owning to their commonalities with non-name words, such as 'hill', 'wood' and 'west'. Similar phenomena occurs with Nigerian (English) names, such as 'faith', 'peace' and 'blessing'.