Skip to content

Evaluation

Euan Soutter edited this page Dec 5, 2025 · 1 revision

Evaluation of idscrub

1. Evaluation using dummy dataset

Rationale

  • We want to quantify how well idscrub removes personally identifiable information (PII).
  • Our aims are to:

Data

  • idscrub has been evaluated using a dummy dataset.
  • The dummy dataset contains contains 50 rows of text, each containing different types of personal data that should be scrubbed by idscrub.

Methodology

  • We compare the scrubbed PII with the PII that we know should be removed to generate an accuracy score (the % of PII removed by the package)
  • This is disaggregated by PII-type. For example, the % of email addresses removed.

Results

  • This table shows the proportion of different PII successfully extracted from the dataset.
Data Type Proportion Found
:person: Names 96%
🏷️ Titles 100%
📧 Email Addresses 100%
💬 Handles 100%
📞 Phone Numbers 100%
📮 Postcodes 100%
🌐 IP Addresses 100%

Key points

  • Regex methods are generally excellent at scrubbing the PII they were designed to remove, with 100% accuracy observed.
  • Our default name-removal NER model (en_core_web_trf) has reduced accuracy compared to regex, but it is still extremely high by name-removal standards (96%), exceeding SpaCy's own benchmark value of 89.8% for NER with this model.
  • Our evaluation does reveal that non-English language names tend to be missed, along with names with initials e.g. John D.
  • This should be taken into account when using idscrub, and mitigations put in place to address this if the dataset is highly sensitive.
  • We should also note that the input data here was 'clean', i.e. the sentences were not broken or technical. Results may vary if data is in a more complex and fragmented format.

2. Evaluation of name scrubbing

Rationale

  • We want to quantify how well idscrub removes names of different types (first names, surnames, names in different languages).
  • Our aims are to:

Data

  • idscrub name removal has been evaluated using a fake names from Faker.
  • We sampled unique names in different languages and from countries (such as English-language British and Indian names, and Hungarian and Polish names).

Methodology

  • We run the names through idscrub.remove_spacy_persons() and measure the percent of names removed.
  • We pass the names with no context and in lowercase, as this represents the most difficult input for transformer-based NER models.
  • This is disaggregated by first name, second name, full name and language/location origin of the name.

Results

  • Across the whole dataset:
    • full names are scrubbed ~94% of the time
    • first names are scrubbed ~78% of the time
    • second names are scrubbed ~70% of the time.
  • This varies considerably by the linguistic source or language of the name, with Hungarian, British and Spanish surnames having the lowest scrub rate (< 60%).
  • In terms of full names, all languages have scrub rates of > 90%, except French and Spanish names.

Key points

  • These results show that name scrubbing struggles without context and when only the first name or second name is present, however a complete lack of context is unlikely in 'real' text.
  • This reiterates the importance of maintaining semantic meaning (e.g. I am [NAME]) and cases (case-typical names, e.g. John Smith) in the text you pass to idscrub as these names are identified much more readily by transformer-based NER models.
  • These results also show that name origin matters, but not in an obviously systematic way, with Nigerian and Croatian full names identified more readily than Spanish and French full names.
  • British surnames are surprisingly poorly identified owning to their commonalities with non-name words, such as 'hill', 'wood' and 'west'. Similar phenomena occurs with Nigerian (English) names, such as 'faith', 'peace' and 'blessing'.