-
Notifications
You must be signed in to change notification settings - Fork 9
fix: Remove overly broad NRP and PERSON entities from default PII detection (#47) #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @gabor-openai , please do review this PR if possible, I've been using OpenAI tools alot and I really appreciate the kind of work you all do in the Agentic Workspace. Do consider this a very humble contribution if possible :) |
|
@codex pls review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a critical usability issue where overly broad regex patterns for NRP and PERSON entities caused massive false positives in PII detection, particularly breaking non-English language support and making pre-flight masking mode unusable in production environments.
Key Changes:
- Removed
NRPandPERSONentities from default PII detection while keeping them available for explicit opt-in - Added deprecation warnings when these entities are explicitly configured to educate users about their limitations
- Implemented comprehensive test coverage (8 new tests) to verify the fix and backward compatibility
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/checks/pii.ts |
Core implementation: Created DEFAULT_PII_ENTITIES constant that filters out NRP and PERSON, added deprecation warning system with deduplication, integrated warnings into main pii function |
src/__tests__/unit/checks/pii.test.ts |
Added comprehensive test suite covering: default exclusion, Spanish text false positive prevention, capitalized phrase handling, explicit opt-in behavior, deprecation warnings, and warning deduplication |
docs/ref/checks/pii.md |
Added detailed deprecation notice explaining the issue, impact on production usage, migration paths to region-specific patterns or NER services, and reference to GitHub issue #47 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/checks/pii.ts
Outdated
| ` - ${entity === PIIEntity.NRP ? 'NRP matches any two consecutive words (e.g., "nuevo cliente", "crea un")' : 'PERSON matches any two capitalized words (e.g., "New York", "The User")'}\n` + | ||
| ` - This pattern causes false positives in normal conversation, especially in non-English languages.\n` + | ||
| ` - Consider using more specific region-based patterns like SG_NRIC_FIN, UK_NINO, etc.\n` + | ||
| ` - To suppress this warning, remove PIIEntity.${entity} from your entities configuration.\n` + |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecation warning message says "To suppress this warning, remove PIIEntity.${entity} from your entities configuration." This is misleading because removing the entity is exactly what the user should do to fix the issue, not just to suppress the warning.
Consider rephrasing to something like: "This warning will appear when ${entity} is explicitly included in your entities configuration." or simply remove this line since the warning naturally disappears when the entity is removed.
| ` - To suppress this warning, remove PIIEntity.${entity} from your entities configuration.\n` + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steven10a - hey Steven! Thanks for taking a look, is this significant or does the original "suppress warning" message make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srivatsan0611 Thank you for the PR, we'll be merging it later today. I would say this Copilot comment is not significant and doesn't need to be addressed. I did just push a commit to short the message as I felt it was verbose but that was also [nitpick].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, saw that, what you pushed made more sense lol.
Thanks for the merge @steven10a :)
gabor-openai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM TY!
Purpose of this PR
The
NRPandPERSONPII entities use regex patterns that are way too broad and cause massive false positives in production. This especially breaks non-English language support and makes the pre-flight masking mode basically unusable.Current broken behavior
Why these patterns are problematic
NRP pattern:
/\b[A-Za-z]+ [A-Za-z]+\b/gPERSON pattern:
/\b[A-Z][a-z]+ [A-Z][a-z]+\b/gImpact
Solution
1. Remove from default entity list
Keep the patterns available but exclude them from defaults:
This makes the default config actually usable while maintaining backward compatibility.
2. Add deprecation warnings
When users explicitly include these entities, show a clear warning:
The warning only shows once per entity per session to avoid log spam.
3. Update documentation
Added clear documentation explaining:
Why this works long-term
Backward compatibility: No breaking changes. Users with explicit entity configurations continue to work exactly as before.
Better defaults: The default configuration now works for international applications without masking normal text.
Clear migration path: Users who actually need person name detection or national registration numbers have better alternatives:
Prevents future issues: Documentation and warnings educate users upfront about the limitations.
Testing
All tests pass (27/27 including 8 new tests):
Manual verification with examples from the issue
Files changed
src/checks/pii.ts- Core implementation with new defaults and deprecation warningssrc/__tests__/unit/checks/pii.test.ts- Added 8 comprehensive testsdocs/ref/checks/pii.md- Updated with migration guideCloses #47