feat: Add HuggingFaceNerRecognizer for direct NER model inference #1834
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1833
Description
This PR adds
HuggingFaceNerRecognizer, a recognizer that uses HuggingFace Transformers pipeline directly for NER, bypassing spaCy tokenizer alignment issues.Why is this needed?
The standard approach using spaCy tokenizer with TransformersNlpEngine has alignment issues for agglutinative languages (Korean, Japanese, Turkish, etc.):
"김태웅이고"(name + particle)"김태웅"(name only)char_span()alignment fails, causing entities to be skipped or incorrectly boundedSolution
HuggingFaceNerRecognizerbypasses spaCy alignment by using HuggingFace pipeline directly.Key Features
inspect-based loading mechanism that allows any recognizer to receive dynamic parameters (e.g.,model_name,device,label_mapping) directly from YAML, while ensuring full backward compatibility for legacy recognizers.Changes
presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.pytests/test_huggingface_ner_recognizer.pypresidio_analyzer/input_validation/yaml_recognizer_models.py(Allow ML fields in config)presidio_analyzer/recognizer_registry/recognizers_loader_utils.py(Pass custom args to recognizers)inspectfor backward compatibility, ensuring legacy recognizers are unaffected while enabling full kwargs support for new ones.presidio_analyzer/predefined_recognizers/ner/__init__.py(export)presidio_analyzer/predefined_recognizers/__init__.py(export)presidio_analyzer/conf/default_recognizers.yaml(config, disabled by default)docs/analyzer/recognizer_registry_provider.md(Added configuration example)CHANGELOG.mdVerification Example: Side-by-Side Comparison
To prove the necessity of this feature, I configured a test environment where both the new
HuggingFaceNerRecognizerand the defaultSpacyRecognizerare enabled for Korean. This allows for comparing their performance directly on the same text.Test Configuration:
Request:
Result Analysis:
The response contains three entities. The first two are correct detections enabled by this PR, while the third is the incorrect "noise" produced by the default system.
HuggingFaceNerRecognizercorrectly splits and identifies "김태웅" (Kim Taewoong) as PERSON and "서울" (Seoul) as LOCATION.SpacyRecognizerfails to handle Korean agglutination, capturing the phrase "이름은 김태웅이고..." (My name is Kim Taewoong and...) as a single PERSON entity.[ { "analysis_explanation": { "original_score": 0.9791115522384644, "pattern": null, "pattern_name": null, "recognizer": "HuggingFace NER KR", "regex_flags": null, "score": 0.9791115522384644, "score_context_improvement": 0, "supportive_context_word": "", "textual_explanation": "Identified as PERSON by Leo97/KoELECTRA-small-v3-modu-ner (original label: PS)", "validation_result": null }, "end": 9, "entity_type": "PERSON", "score": 0.9791115522384644, "start": 6 }, { "analysis_explanation": { "original_score": 0.9564878344535828, "pattern": null, "pattern_name": null, "recognizer": "HuggingFace NER KR", "regex_flags": null, "score": 0.9564878344535828, "score_context_improvement": 0, "supportive_context_word": "", "textual_explanation": "Identified as LOCATION by Leo97/KoELECTRA-small-v3-modu-ner (original label: LC)", "validation_result": null }, "end": 14, "entity_type": "LOCATION", "score": 0.9564878344535828, "start": 12 }, { "analysis_explanation": { "original_score": 0.85, "pattern": null, "pattern_name": null, "recognizer": "SpacyRecognizer", "regex_flags": null, "score": 0.85, "score_context_improvement": 0, "supportive_context_word": "", "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition", "validation_result": null }, "end": 23, "entity_type": "PERSON", "score": 0.85, "start": 2 } ]Production Configuration Tip
Although the HuggingFace recognizer functions independently, the Presidio Analyzer Platform requires a default NLP engine declaration for startup. For production environments where Spacy is not needed, I recommend enabling
SpacyRecognizerbut settingsupported_entities: []to cleanly bypass it without generating noise.Testing
Added unit tests covering:
English person/location/organization detection
Korean text with particles (agglutinative language demo)
Multiple entity types
Low confidence filtering
Custom label mapping
Empty text handling
Model name validation
Long text chunking & truncation safety
I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required