Fix gliner truncates text #1805

jedheaj314 · 2025-12-02T16:56:55Z

Change Description

Change to be able to to call recogniser with chunker and call text chunking (character based) when calling analyse in gliner recogniser.

Updated gliner recognizer to call predict function after chunking long text
New base and character based chunkers
New chunking util for deduplication and offset calculations
Unit tests

Issue reference

Fixes #1569

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

Architectural/Technical Decisions:

🎯 Problem Statement

Issue #1569: GLiNER checkpoint has a maximum sequence length of 384 tokens. Documents exceeding this limit are silently truncated with a warning "UserWarning: Sentence of length 20415 has been truncated to 384".

Impact:

Security: PII entities beyond 384 tokens are not detected
Compliance: Incomplete data protection scanning (GDPR, HIPAA violations)
Reliability: Silent failure mode with no user notification

Root Cause: GLiNER truncating text to fixed context window of 384 tokens to detect PIIs.

🔍 Approach Evaluation

Option 1: Increase Model Context Window

Description: Retrain or use larger GLiNER model with higher token limit
Decision: ❌ Rejected - Not feasible, GLiNER architecture is fixed

Option 2: Token-Based Chunking

Description: Use GLiNER's tokenizer to chunk at exact token boundaries

Pros	Cons
Precise token count control	Requires tokenizer loading
Guarantees < 384 tokens	Model-specific implementation
Optimal model utilization	Complex offset calculation
	Tokenizer adds latency
	Ties implementation to model architecture

Decision: ❌ Rejected - Complexity outweighs benefits

Option 3: Character-Based Chunking ✅

Description: Split text by character count with configurable overlap

Pros	Cons
Simple implementation	Approximate token count
Model-agnostic	Needs safety margin
Fast (no tokenization)	May underutilize context window
Easy to configure	Requires deduplication
Works with any NER model

Decision: ✅ SELECTED - Best balance of simplicity and effectiveness

Rationale:

GLiNER averages ~1.5 chars/token
250 chars = ~166 tokens (safe margin under 384)
Simplicity enables maintainability
Works with future models without changes

Option 4: Sentence-Based Chunking

Description: Split on sentence boundaries using NLP

Pros	Cons
Natural language boundaries	Requires sentence tokenizer
Preserves context	Variable chunk sizes
Better semantic coherence	Long sentences problematic
	Adds dependency

Decision: ❌ Rejected - Unnecessary complexity for current needs

Future Consideration: Could implement as alternative strategy later

🎛️ Key Design Decisions

Decision 1: Chunk Size = 250 characters

Options Considered:

100 chars: Too conservative, many chunks, slow
250 chars: ✅ Selected - ~166 tokens, safe margin
500 chars: Too risky, may exceed 384 token limit

Analysis:

Token estimation: chars / 1.5 = tokens
250 / 1.5 = ~166 tokens (< 384 ✓)
Safety margin: 218 tokens = 57% buffer

Why 250?

Proven ratio from gliner-spacy reference implementation
Sufficient context for entity recognition
Safe margin for token variation (special chars, unicode)

Decision 2: Overlap = 50 characters (20%)

Problem: Entities at chunk boundaries might be split

Options Considered:

0% overlap: Fast but misses boundary entities ❌
10% (25 chars): ❌ Too small, may still split entities
20% (50 chars): ✅ Selected - catches boundary entities
50% (125 chars): ❌ Excessive duplication, slow

Why 20%?

Average entity length: 10-30 characters
50 char overlap covers typical entity spans
Balances coverage vs redundant processing
Standard practice in NLP chunking

Decision 3: Word Boundary Preservation

Problem: Mid-word breaks confuse NER models

# Extends chunk to nearest word boundary
while end < len(text) and text[end] not in [" ", "\n"]:
    end += 1

Options Considered:

Approach	Pros	Cons	Decision
Hard cutoff	Simple	Breaks words	❌
Word boundary	Preserves context	Variable chunk size	✅
Sentence boundary	Best context	Too complex	❌

Trade-off Accepted:

Chunks may exceed 250 chars slightly
Still well under token limit (tested: max ~280 chars)
Better entity detection outweighs strict size limit

Decision 4: Deduplication Strategy

Problem: Overlapping chunks produce duplicate entities

Approach: Score-based deduplication with overlap threshold

overlap_ratio = overlap_length / min(entity1_length, entity2_length)
if overlap_ratio > 0.5:  # 50% threshold
    keep_highest_score_entity()

Why 50% threshold?

Catches true duplicates: [10:20] and [10:20] → 100% overlap
Allows partial overlaps: [10:20] and [15:25] → 50% overlap
Avoids false positives: [10:20] and [30:40] → 0% overlap

Alternative Considered:

Exact match only (start==start, end==end)
- ❌ Misses duplicates with slight offset differences
- Models may return [10:20] in one chunk, [11:20] in another

Decision 5: Architecture Pattern = Strategy Pattern

Why Strategy Pattern?

BaseTextChunker (interface)
    ↓
LocalTextChunker (current implementation)
    ↓
Future: RemoteChunker, LangChainChunker, SemanticChunker

Benefits:

Open/Closed Principle: New strategies without modifying existing code
Testability: Easy to mock chunker in tests
Flexibility: Users can plug in custom chunkers
Future-proof: Supports evolution of chunking approaches

Trade-off:

More code upfront (abstract base class)
✅ Worth it: Enables extensibility without breaking changes

Decision 6: Utility Functions - Simple Over Generic

Approach: Utilities designed for character-based chunking with clear assumptions

predict_with_chunking(
    text=text,
    predict_func=any_prediction_function,  # Model-agnostic
    chunker=any_chunker_implementation      # Character-based expected
)

Key Assumptions:

Chunker has chunk_size (characters)
Chunker has chunk_overlap (characters)
Short-circuit check: len(text) <= chunker.chunk_size

Why Simple Design?

YAGNI Principle: Token-based chunking was rejected, no need to optimize for it
Clarity: Code is immediately understandable
Maintainability: Fewer abstractions = easier debugging
Strategy Pattern Already Provides Extensibility: New chunker types can be added

Alternative Considered: More Generic Interface

# Rejected approach
chunker.should_chunk(text)  # Instead of len() check
chunker.get_overlap_size()  # Dynamic overlap

Why Rejected:

❌ Premature abstraction for unused feature
❌ More complex interface
❌ Harder to understand
✅ Can refactor later if token-based is actually needed (unlikely)

Trade-off Accepted:

Future chunker types must work with character-based assumptions OR
Refactor utils if genuinely needed (straightforward change)

🏗️ Implementation Decisions

Offset Calculation

Critical Decision: How to map chunk entity positions to original text?

Approach Selected:

offset += len(chunk) - chunk_overlap

Why len(chunk) not chunk_size?

Word boundary extension creates variable-length chunks
Using actual length ensures accurate position mapping
Example: chunk_size=250, actual=273 (word boundary)

Validation:

Tested with word boundaries: ✅ Positions correct
Tested with CJK text (no spaces): ✅ Works
Tested with special characters: ✅ Accurate

Error Handling Philosophy

Decision: Trust upstream components (GLiNER), minimal defensive coding

Rationale from Code Review:

23 potential issues identified
22 were false alarms (defensive programming paranoia)
1 real bug (parameter redundancy - fixed)

Approach:

✅ Trust GLiNER to return valid predictions
✅ Trust Python to handle edge cases (empty strings, etc.)
❌ Avoid unnecessary validation code
❌ Don't add error handling "just in case"

Example - Rejected Defensive Code:

# Considered but rejected
if pred["end"] > len(text):
    logger.warning("Entity beyond text")
    continue

Why rejected? GLiNER never returns invalid positions. Adding checks adds complexity for zero benefit.

📊 Trade-offs Summary

Performance vs Accuracy

Decision: Prioritize accuracy over raw speed

Aspect	Choice	Rationale
Overlap	20% duplication	Catches boundary entities
Word boundaries	Variable chunk size	Better entity detection
Deduplication	O(n²) algorithm	Simple and correct

Performance Result:

1,000 entities: 12,824/sec (fast enough)
Typical doc: 10-200 entities, < 0.1s overhead
✅ No optimization needed yet

Simplicity vs Flexibility

Decision: Strategy pattern for future extensibility

Trade-off:

More code upfront (base class + concrete)
Benefit: Easy to add new chunking strategies
Verdict: ✅ Worth it - prevents future breaking changes

Character-based vs Token-based

Decision: Character-based for simplicity

Trade-off:

Less precise token control
Benefit: Model-agnostic, no tokenizer overhead
Mitigation: Large safety margin (57% buffer)
Verdict: ✅ Simplicity wins

🧪 Validation Approach

Testing Strategy

Test Coverage:

27 unit tests
Edge cases: CJK text, newlines, empty strings, zero-length entities
Real-world scenarios: Long documents, overlapping entities

Key Test Insights:

Word boundaries don't cause infinite loops (verified with CJK)
Offset calculation handles variable chunks (verified with entity positions)
Deduplication handles zero-length entities (edge case discovered)

Chunking Approaches Comparison

Aspect	Current (Overlap)	LangChain Pattern	LangChain Dependency
Dependencies	None	None	`langchain-text-splitters` (~2MB+)
Overlap	✅ Yes	❌ No	✅ Yes
Deduplication	✅ Required	❌ Not needed	If overlap enabled
Separator Logic	Word boundary	`\n\n` → `\n` → → ``	`\n\n` → `\n` → → ``
Boundary Accuracy	Higher	Lower	Configurable
Maintenance Risk	Low	Low	Medium (external)

Why Overlap + Deduplication?

Without overlap: Entity "Dr. John Smith" split across chunks → missed

With overlap: Same entity detected in multiple chunks → needs deduplication

🚀 Future Considerations

Immediate Monitoring Needs

Chunk count distribution: How many chunks per document?
Deduplication rate: How many duplicates removed?
Latency impact: Overhead from chunking?

Potential Enhancements

Enhancement 1: Parallel Chunk Processing

When: If latency becomes issue (>1s per document)
Approach: Process chunks concurrently
Expected gain: 2-3x speedup for large documents

Enhancement 2: Adaptive Chunk Size

When: If we see frequent boundary misses
Approach: Adjust chunk size based on entity density
Trade-off: Added complexity vs marginal gain

Enhancement 3: Alternative Chunking Strategies

When: User needs semantic chunking
Approach: Implement via Strategy pattern

SemanticChunker()  # Preserves paragraphs
SentenceChunker()  # Natural sentence boundaries  
LangChainChunker() # Integration with LangChain

Enhancement 4: Deduplication Optimization

When: Typical document has >1000 entities
Approach: Spatial indexing (O(n log n))
Current: Not needed - O(n²) is fast enough

…l to chunking from gliner recognizer

RonShakutai

This PR would be a Great addition to presidio capabilities !! and probably used in other use cases.
Left a few comments

presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py

presidio-analyzer/presidio_analyzer/chunkers/__init__.py

presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py

presidio-analyzer/presidio_analyzer/chunkers/chunking_utils.py

omri374

Thanks! This is great! left a few comments but I feel this is really close to being ready.

presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py

presidio-analyzer/presidio_analyzer/chunkers/chunking_utils.py

RonShakutai · 2026-01-04T08:00:01Z

Hi :) @jedheaj314
Are you still planning to complete this PR?

jedheaj314 · 2026-01-06T09:29:56Z

Hi :) @jedheaj314 Are you still planning to complete this PR?

Yes, working through the comments now

presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py

…ates-text' into jedheaj314/1569-fix-gliner-truncates-text

e2e-tests/requirements.txt

omri374 · 2026-01-15T15:43:18Z

Thanks @jedheaj314! This is shaping up quite nicely. I added a few comments on potential side effects of this on Presidio in general to think about, but overall the logic and architecture are solid.

omri374

Thanks @jedheaj314! This is a great addition, and thank for the patience around the numerous rounds of feedback. I left a small comment on the e2e tests but overall this is approved :)

AJ (Ashitosh Jedhe) added 8 commits November 25, 2025 10:24

Add failing test for - gliner truncates text and misses names (PII)

6c82ee7

Update gliner recognizer to implement basic chunking

b04d9c7

Add changes for chunking capabilities including local chuking and cal…

e0eb745

…l to chunking from gliner recognizer

Remove gliner image redaction test - not required

71fb611

Rename local text chunker to character based text chunker

c986737

Fix rename leftovers

ea49b70

Update doc string

83e2bd4

Add test for text without spaces and unicodes

5553245

github-actions bot added the external label Dec 2, 2025

AJ (Ashitosh Jedhe) and others added 2 commits December 2, 2025 17:11

Resove linting - format code

0d53ce1

Merge branch 'main' into jedheaj314/1569-fix-gliner-truncates-text

c1ae52f

RonShakutai self-requested a review December 3, 2025 10:29

RonShakutai reviewed Dec 3, 2025

View reviewed changes

RonShakutai requested a review from omri374 December 3, 2025 11:53

AJ (Ashitosh Jedhe) added 2 commits December 3, 2025 12:03

Add logging to character based text chunker

560021c

Update to remove redundent chunk_overlap parameter

1556d73

jedheaj314 requested a review from RonShakutai December 3, 2025 13:07

Merge branch 'main' into jedheaj314/1569-fix-gliner-truncates-text

9324450

omri374 reviewed Dec 10, 2025

View reviewed changes

Merge branch 'main' into jedheaj314/1569-fix-gliner-truncates-text

c073bb7

jedheaj314 closed this Jan 6, 2026

jedheaj314 reopened this Jan 6, 2026

AJ (Ashitosh Jedhe) added 6 commits January 6, 2026 09:51

Remove chunk size and chunk overlap from GlinerRecognizer constructor

d722aaa

Updated the utilities to use RecognizerResult

8f637de

Update so that utils methods are part of base chunker

86f16c1

Add chunker factory

0aea1e1

Create Lang chain text chunker

3f4e5b8

Remove Character based inhouse chunker

72de850

jedheaj314 dismissed RonShakutai’s stale review via 6bf5ee8 January 14, 2026 11:13

Merge branch 'main' into jedheaj314/1569-fix-gliner-truncates-text

bf3ac56

RonShakutai reviewed Jan 15, 2026

View reviewed changes

presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py Outdated Show resolved Hide resolved

AJ (Ashitosh Jedhe) and others added 4 commits January 15, 2026 12:58

Fix langchain installtion - review comment

b818e2b

Merge branch 'main' into jedheaj314/1569-fix-gliner-truncates-text

b6eca56

Merge remote-tracking branch 'origin/jedheaj314/1569-fix-gliner-trunc…

69610e4

…ates-text' into jedheaj314/1569-fix-gliner-truncates-text

Add conditional import of lang chain

58b33d9

omri374 reviewed Jan 15, 2026

View reviewed changes

e2e-tests/requirements.txt Outdated Show resolved Hide resolved

ultramancode mentioned this pull request Jan 17, 2026

feat: Add HuggingFaceNerRecognizer for direct NER model inference #1834

Open

5 tasks

tamirkamara changed the title ~~1569 Fix gliner truncates text~~ Fix gliner truncates text Jan 19, 2026

microsoft deleted a comment from SharonHart Jan 19, 2026

jedheaj314 and others added 6 commits January 20, 2026 22:34

Merge branch 'main' into jedheaj314/1569-fix-gliner-truncates-text

ef5a566

Revert to use in-house chunker

3b916e0

Merge remote branch with in-house chunker changes

34ea19a

Fix line too long (lint)

fe9ebb7

Fix trailing whitespace lint error

f8a6017

Revemo not required comment

b102a52

omri374 previously approved these changes Jan 22, 2026

View reviewed changes

jedheaj314 dismissed omri374’s stale review via ce806c2 January 22, 2026 17:27

Remove gliner extras from e2e tests to fix CI disk space issue

60816f3

jedheaj314 force-pushed the jedheaj314/1569-fix-gliner-truncates-text branch from ce806c2 to 60816f3 Compare January 22, 2026 17:32

Remove trailing comma in pyproject.toml to match main

2ce4662

jedheaj314 requested review from RonShakutai, SharonHart and omri374 January 22, 2026 17:35

omri374 approved these changes Jan 22, 2026

View reviewed changes

RonShakutai removed their request for review January 26, 2026 16:44

SharonHart merged commit 5d92cf8 into microsoft:main Jan 27, 2026
86 of 88 checks passed

Fix gliner truncates text #1805

Fix gliner truncates text #1805

Uh oh!

Conversation

jedheaj314 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Issue reference

Checklist

🎯 Problem Statement

🔍 Approach Evaluation

Option 1: Increase Model Context Window

Option 2: Token-Based Chunking

Option 3: Character-Based Chunking ✅

Option 4: Sentence-Based Chunking

🎛️ Key Design Decisions

Decision 1: Chunk Size = 250 characters

Decision 2: Overlap = 50 characters (20%)

Decision 3: Word Boundary Preservation

Decision 4: Deduplication Strategy

Decision 5: Architecture Pattern = Strategy Pattern

Decision 6: Utility Functions - Simple Over Generic

🏗️ Implementation Decisions

Offset Calculation

Error Handling Philosophy

📊 Trade-offs Summary

Performance vs Accuracy

Simplicity vs Flexibility

Character-based vs Token-based

🧪 Validation Approach

Testing Strategy

Chunking Approaches Comparison

Why Overlap + Deduplication?

🚀 Future Considerations

Immediate Monitoring Needs

Potential Enhancements

Enhancement 1: Parallel Chunk Processing

Enhancement 2: Adaptive Chunk Size

Enhancement 3: Alternative Chunking Strategies

Enhancement 4: Deduplication Optimization

Uh oh!

RonShakutai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omri374 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RonShakutai commented Jan 4, 2026

Uh oh!

jedheaj314 commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

omri374 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omri374 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

jedheaj314 commented Dec 2, 2025 •

edited

Loading

omri374 commented Jan 15, 2026 •

edited

Loading