Skip to content

Conversation

@ultramancode
Copy link

@ultramancode ultramancode commented Jan 17, 2026

Fixes #1833

Description

This PR adds HuggingFaceNerRecognizer, a recognizer that uses HuggingFace Transformers pipeline directly for NER, bypassing spaCy tokenizer alignment issues.

Why is this needed?

The standard approach using spaCy tokenizer with TransformersNlpEngine has alignment issues for agglutinative languages (Korean, Japanese, Turkish, etc.):

  • Particles/postpositions attach to nouns
  • spaCy tokenizer includes particles: "김태웅이고" (name + particle)
  • NER model returns only the entity: "김태웅" (name only)
  • char_span() alignment fails, causing entities to be skipped or incorrectly bounded

Solution

HuggingFaceNerRecognizer bypasses spaCy alignment by using HuggingFace pipeline directly.

Key Features

  • Language-agnostic: Works with any HuggingFace NER model
  • Direct inference: No spaCy tokenizer dependency for entity boundaries
  • Extensible YAML Configuration: Implemented a smart, inspect-based loading mechanism that allows any recognizer to receive dynamic parameters (e.g., model_name, device, label_mapping) directly from YAML, while ensuring full backward compatibility for legacy recognizers.

Changes

  • NEW: presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
  • NEW: tests/test_huggingface_ner_recognizer.py
  • MODIFY: presidio_analyzer/input_validation/yaml_recognizer_models.py (Allow ML fields in config)
  • MODIFY: presidio_analyzer/recognizer_registry/recognizers_loader_utils.py (Pass custom args to recognizers)
    • Implemented smart argument passing via inspect for backward compatibility, ensuring legacy recognizers are unaffected while enabling full kwargs support for new ones.
  • MODIFY: presidio_analyzer/predefined_recognizers/ner/__init__.py (export)
  • MODIFY: presidio_analyzer/predefined_recognizers/__init__.py (export)
  • MODIFY: presidio_analyzer/conf/default_recognizers.yaml (config, disabled by default)
  • MODIFY: docs/analyzer/recognizer_registry_provider.md (Added configuration example)
  • MODIFY: CHANGELOG.md

Verification Example: Side-by-Side Comparison

To prove the necessity of this feature, I configured a test environment where both the new HuggingFaceNerRecognizer and the default SpacyRecognizer are enabled for Korean. This allows for comparing their performance directly on the same text.

Test Configuration:

recognizers:
  - name: "HuggingFace NER KR"
    class_name: "HuggingFaceNerRecognizer"
    model_name: "Leo97/KoELECTRA-small-v3-modu-ner"
    supported_languages: ["ko"]
  
  - name: "SpacyRecognizer" # Intentionally enabled for comparison
    class_name: "SpacyRecognizer"
    supported_languages: ["ko"]
    enabled: true
postman_2

Request:

curl --location 'http://localhost:3000/analyze' \
--header 'Content-Type: application/json' \
--data '{
    "text": "제 이름은 김태웅이고 서울에 살고 있습니다.",
    "language": "ko",
    "return_decision_process": true
}'

Result Analysis:
The response contains three entities. The first two are correct detections enabled by this PR, while the third is the incorrect "noise" produced by the default system.

  • Result 1 & 2 (Success): The HuggingFaceNerRecognizer correctly splits and identifies "김태웅" (Kim Taewoong) as PERSON and "서울" (Seoul) as LOCATION.
  • Result 3 (Failure): The SpacyRecognizer fails to handle Korean agglutination, capturing the phrase "이름은 김태웅이고..." (My name is Kim Taewoong and...) as a single PERSON entity.
[
    {
        "analysis_explanation": {
            "original_score": 0.9791115522384644,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "HuggingFace NER KR",
            "regex_flags": null,
            "score": 0.9791115522384644,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as PERSON by Leo97/KoELECTRA-small-v3-modu-ner (original label: PS)",
            "validation_result": null
        },
        "end": 9,
        "entity_type": "PERSON",
        "score": 0.9791115522384644,
        "start": 6
    },
    {
        "analysis_explanation": {
            "original_score": 0.9564878344535828,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "HuggingFace NER KR",
            "regex_flags": null,
            "score": 0.9564878344535828,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as LOCATION by Leo97/KoELECTRA-small-v3-modu-ner (original label: LC)",
            "validation_result": null
        },
        "end": 14,
        "entity_type": "LOCATION",
        "score": 0.9564878344535828,
        "start": 12
    },
    {
        "analysis_explanation": {
            "original_score": 0.85,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "SpacyRecognizer",
            "regex_flags": null,
            "score": 0.85,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition",
            "validation_result": null
        },
        "end": 23,
        "entity_type": "PERSON",
        "score": 0.85,
        "start": 2
    }
]

Production Configuration Tip

Although the HuggingFace recognizer functions independently, the Presidio Analyzer Platform requires a default NLP engine declaration for startup. For production environments where Spacy is not needed, I recommend enabling SpacyRecognizer but setting supported_entities: [] to cleanly bypass it without generating noise.

- name: "SpacyRecognizer"
  class_name: "SpacyRecognizer"
  supported_languages: ["ko"]
  enabled: true # Ensure at least one NLP engine is active for this language
  supported_entities: [] # Silence the default engine

Testing

Added unit tests covering:

  • English person/location/organization detection

  • Korean text with particles (agglutinative language demo)

  • Multiple entity types

  • Low confidence filtering

  • Custom label mapping

  • Empty text handling

  • Model name validation

  • Long text chunking & truncation safety

  • I have reviewed the contribution guidelines

  • I have signed the CLA (if required)

  • My code includes unit tests

  • All unit tests and lint checks pass locally

  • My PR contains documentation updates / additions if required

- Bypass spaCy tokenizer alignment issues with agglutinative languages
- Support any HuggingFace token-classification model
@ultramancode
Copy link
Author

@microsoft-github-policy-service agree

- Text chunking with configurable overlap

- Batch inference with fallback for compatibility

- Deduplication keeping highest confidence scores
- Allow ML-specific fields (model_name, device, etc.) in PredefinedRecognizerConfig.
- Implement dynamic argument filtering in loader to match recognizer signatures.
- Enable YAML support for HuggingFaceNerRecognizer while preserving backward compatibility.
- Document HuggingFace NER YAML configuration standard
@ultramancode
Copy link
Author

I’ll reuse PR #1805’s chunking logic once it’s merged to keep things consistent. Any other feedback is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add HuggingFaceNerRecognizer for direct NER model inference

1 participant