feat: Add HuggingFaceNerRecognizer for direct NER model inference #1834

ultramancode · 2026-01-17T16:20:30Z

Description

This PR adds HuggingFaceNerRecognizer, a recognizer that uses HuggingFace Transformers pipeline directly for NER, bypassing spaCy tokenizer alignment issues.

Why is this needed?

The standard approach using spaCy tokenizer with TransformersNlpEngine has alignment issues for agglutinative languages (Korean, Japanese, Turkish, etc.):

Particles/postpositions attach to nouns
spaCy tokenizer includes particles: "김태웅이고" (name + particle)
NER model returns only the entity: "김태웅" (name only)
char_span() alignment fails, causing entities to be skipped or incorrectly bounded

Solution

HuggingFaceNerRecognizer bypasses spaCy alignment by using HuggingFace pipeline directly.

Key Features

Language-agnostic: Works with any HuggingFace NER model
Direct inference: No spaCy tokenizer dependency for entity boundaries
Extensible YAML Configuration: Implemented a smart, inspect-based loading mechanism that allows any recognizer to receive dynamic parameters (e.g., model_name, device, label_mapping) directly from YAML, while ensuring full backward compatibility for legacy recognizers.

Changes

NEW: presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
NEW: tests/test_huggingface_ner_recognizer.py
MODIFY: presidio_analyzer/input_validation/yaml_recognizer_models.py (Allow ML fields in config)
MODIFY: presidio_analyzer/recognizer_registry/recognizers_loader_utils.py (Pass custom args to recognizers)
- Implemented smart argument passing via inspect for backward compatibility, ensuring legacy recognizers are unaffected while enabling full kwargs support for new ones.
MODIFY: presidio_analyzer/predefined_recognizers/ner/__init__.py (export)
MODIFY: presidio_analyzer/predefined_recognizers/__init__.py (export)
MODIFY: presidio_analyzer/conf/default_recognizers.yaml (config, disabled by default)
MODIFY: docs/analyzer/recognizer_registry_provider.md (Added configuration example)
MODIFY: CHANGELOG.md

Verification Example: Side-by-Side Comparison

To prove the necessity of this feature, I configured a test environment where both the new HuggingFaceNerRecognizer and the default SpacyRecognizer are enabled for Korean. This allows for comparing their performance directly on the same text.

Test Configuration:

recognizers:
  - name: "HuggingFace NER KR"
    class_name: "HuggingFaceNerRecognizer"
    model_name: "Leo97/KoELECTRA-small-v3-modu-ner"
    supported_languages: ["ko"]
  
  - name: "SpacyRecognizer" # Intentionally enabled for comparison
    class_name: "SpacyRecognizer"
    supported_languages: ["ko"]
    enabled: true

Request:

curl --location 'http://localhost:3000/analyze' \
--header 'Content-Type: application/json' \
--data '{
    "text": "제 이름은 김태웅이고 서울에 살고 있습니다.",
    "language": "ko",
    "return_decision_process": true
}'

Result Analysis:
The response contains three entities. The first two are correct detections enabled by this PR, while the third is the incorrect "noise" produced by the default system.

Result 1 & 2 (Success): The HuggingFaceNerRecognizer correctly splits and identifies "김태웅" (Kim Taewoong) as PERSON and "서울" (Seoul) as LOCATION.
Result 3 (Failure): The SpacyRecognizer fails to handle Korean agglutination, capturing the phrase "이름은 김태웅이고..." (My name is Kim Taewoong and...) as a single PERSON entity.

[
    {
        "analysis_explanation": {
            "original_score": 0.9791115522384644,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "HuggingFace NER KR",
            "regex_flags": null,
            "score": 0.9791115522384644,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as PERSON by Leo97/KoELECTRA-small-v3-modu-ner (original label: PS)",
            "validation_result": null
        },
        "end": 9,
        "entity_type": "PERSON",
        "score": 0.9791115522384644,
        "start": 6
    },
    {
        "analysis_explanation": {
            "original_score": 0.9564878344535828,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "HuggingFace NER KR",
            "regex_flags": null,
            "score": 0.9564878344535828,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as LOCATION by Leo97/KoELECTRA-small-v3-modu-ner (original label: LC)",
            "validation_result": null
        },
        "end": 14,
        "entity_type": "LOCATION",
        "score": 0.9564878344535828,
        "start": 12
    },
    {
        "analysis_explanation": {
            "original_score": 0.85,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "SpacyRecognizer",
            "regex_flags": null,
            "score": 0.85,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition",
            "validation_result": null
        },
        "end": 23,
        "entity_type": "PERSON",
        "score": 0.85,
        "start": 2
    }
]

Production Configuration Tip

Although the HuggingFace recognizer functions independently, the Presidio Analyzer Platform requires a default NLP engine declaration for startup. For production environments where Spacy is not needed, I recommend enabling SpacyRecognizer but setting supported_entities: [] to cleanly bypass it without generating noise.

- name: "SpacyRecognizer"
  class_name: "SpacyRecognizer"
  supported_languages: ["ko"]
  enabled: true # Ensure at least one NLP engine is active for this language
  supported_entities: [] # Silence the default engine

Testing

Added unit tests covering:

English person/location/organization detection
Korean text with particles (agglutinative language demo)
Multiple entity types
Low confidence filtering
Custom label mapping
Empty text handling
Model name validation
Long text chunking & truncation safety
I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

- Bypass spaCy tokenizer alignment issues with agglutinative languages - Support any HuggingFace token-classification model

ultramancode · 2026-01-17T16:25:09Z

@microsoft-github-policy-service agree

- Text chunking with configurable overlap - Batch inference with fallback for compatibility - Deduplication keeping highest confidence scores

- Allow ML-specific fields (model_name, device, etc.) in PredefinedRecognizerConfig. - Implement dynamic argument filtering in loader to match recognizer signatures. - Enable YAML support for HuggingFaceNerRecognizer while preserving backward compatibility.

- Document HuggingFace NER YAML configuration standard

ultramancode · 2026-01-24T20:00:47Z

I’ll reuse PR #1805’s chunking logic once it’s merged to keep things consistent. Any other feedback is welcome.

feat: Add HuggingFaceNerRecognizer for direct NER model inference

bde8242

- Bypass spaCy tokenizer alignment issues with agglutinative languages - Support any HuggingFace token-classification model

github-actions bot added the external label Jan 17, 2026

ultramancode added 4 commits January 18, 2026 10:39

feat: add chunking and batch processing to HuggingFaceNerRecognizer

c2cd23e

- Text chunking with configurable overlap - Batch inference with fallback for compatibility - Deduplication keeping highest confidence scores

docs: add ML recognizer YAML configuration guide

b47e2df

- Document HuggingFace NER YAML configuration standard

docs: add configuration guide for HuggingFaceNerRecognizer

7dd8130

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add HuggingFaceNerRecognizer for direct NER model inference #1834

feat: Add HuggingFaceNerRecognizer for direct NER model inference #1834

Uh oh!

ultramancode commented Jan 17, 2026 •

edited

Loading

Uh oh!

ultramancode commented Jan 17, 2026

Uh oh!

ultramancode commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add HuggingFaceNerRecognizer for direct NER model inference #1834

Are you sure you want to change the base?

feat: Add HuggingFaceNerRecognizer for direct NER model inference #1834

Uh oh!

Conversation

ultramancode commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why is this needed?

Solution

Key Features

Changes

Verification Example: Side-by-Side Comparison

Production Configuration Tip

Testing

Uh oh!

ultramancode commented Jan 17, 2026

Uh oh!

ultramancode commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ultramancode commented Jan 17, 2026 •

edited

Loading