Skip to content

Conversation

@jedheaj314
Copy link
Contributor

@jedheaj314 jedheaj314 commented Dec 2, 2025

Change Description

Change to be able to to call recogniser with chunker and call text chunking (character based) when calling analyse in gliner recogniser.

  • Updated gliner recognizer to call predict function after chunking long text
  • New base and character based chunkers
  • New chunking util for deduplication and offset calculations
  • Unit tests

Issue reference

Fixes #1569

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Architectural/Technical Decisions:

🎯 Problem Statement

Issue #1569: GLiNER checkpoint has a maximum sequence length of 384 tokens. Documents exceeding this limit are silently truncated with a warning "UserWarning: Sentence of length 20415 has been truncated to 384".

Impact:

  • Security: PII entities beyond 384 tokens are not detected
  • Compliance: Incomplete data protection scanning (GDPR, HIPAA violations)
  • Reliability: Silent failure mode with no user notification

Root Cause: GLiNER truncating text to fixed context window of 384 tokens to detect PIIs.


🔍 Approach Evaluation

Option 1: Increase Model Context Window

Description: Retrain or use larger GLiNER model with higher token limit
Decision:Rejected - Not feasible, GLiNER architecture is fixed


Option 2: Token-Based Chunking

Description: Use GLiNER's tokenizer to chunk at exact token boundaries

Pros Cons
Precise token count control Requires tokenizer loading
Guarantees < 384 tokens Model-specific implementation
Optimal model utilization Complex offset calculation
Tokenizer adds latency
Ties implementation to model architecture

Decision:Rejected - Complexity outweighs benefits


Option 3: Character-Based Chunking ✅

Description: Split text by character count with configurable overlap

Pros Cons
Simple implementation Approximate token count
Model-agnostic Needs safety margin
Fast (no tokenization) May underutilize context window
Easy to configure Requires deduplication
Works with any NER model

Decision:SELECTED - Best balance of simplicity and effectiveness

Rationale:

  • GLiNER averages ~1.5 chars/token
  • 250 chars = ~166 tokens (safe margin under 384)
  • Simplicity enables maintainability
  • Works with future models without changes

Option 4: Sentence-Based Chunking

Description: Split on sentence boundaries using NLP

Pros Cons
Natural language boundaries Requires sentence tokenizer
Preserves context Variable chunk sizes
Better semantic coherence Long sentences problematic
Adds dependency

Decision:Rejected - Unnecessary complexity for current needs

Future Consideration: Could implement as alternative strategy later


🎛️ Key Design Decisions

Decision 1: Chunk Size = 250 characters

Options Considered:

  • 100 chars: Too conservative, many chunks, slow
  • 250 chars: ✅ Selected - ~166 tokens, safe margin
  • 500 chars: Too risky, may exceed 384 token limit

Analysis:

Token estimation: chars / 1.5 = tokens
250 / 1.5 = ~166 tokens (< 384 ✓)
Safety margin: 218 tokens = 57% buffer

Why 250?

  • Proven ratio from gliner-spacy reference implementation
  • Sufficient context for entity recognition
  • Safe margin for token variation (special chars, unicode)

Decision 2: Overlap = 50 characters (20%)

Problem: Entities at chunk boundaries might be split

Options Considered:

  • 0% overlap: Fast but misses boundary entities ❌
  • 10% (25 chars): ❌ Too small, may still split entities
  • 20% (50 chars): ✅ Selected - catches boundary entities
  • 50% (125 chars): ❌ Excessive duplication, slow

Why 20%?

  • Average entity length: 10-30 characters
  • 50 char overlap covers typical entity spans
  • Balances coverage vs redundant processing
  • Standard practice in NLP chunking

Decision 3: Word Boundary Preservation

Problem: Mid-word breaks confuse NER models

# Extends chunk to nearest word boundary
while end < len(text) and text[end] not in [" ", "\n"]:
    end += 1

Options Considered:

Approach Pros Cons Decision
Hard cutoff Simple Breaks words
Word boundary Preserves context Variable chunk size
Sentence boundary Best context Too complex

Trade-off Accepted:

  • Chunks may exceed 250 chars slightly
  • Still well under token limit (tested: max ~280 chars)
  • Better entity detection outweighs strict size limit

Decision 4: Deduplication Strategy

Problem: Overlapping chunks produce duplicate entities

Approach: Score-based deduplication with overlap threshold

overlap_ratio = overlap_length / min(entity1_length, entity2_length)
if overlap_ratio > 0.5:  # 50% threshold
    keep_highest_score_entity()

Why 50% threshold?

  • Catches true duplicates: [10:20] and [10:20] → 100% overlap
  • Allows partial overlaps: [10:20] and [15:25] → 50% overlap
  • Avoids false positives: [10:20] and [30:40] → 0% overlap

Alternative Considered:

  • Exact match only (start==start, end==end)
    • ❌ Misses duplicates with slight offset differences
    • Models may return [10:20] in one chunk, [11:20] in another

Decision 5: Architecture Pattern = Strategy Pattern

Why Strategy Pattern?

BaseTextChunker (interface)
    ↓
LocalTextChunker (current implementation)
    ↓
Future: RemoteChunker, LangChainChunker, SemanticChunker

Benefits:

  • Open/Closed Principle: New strategies without modifying existing code
  • Testability: Easy to mock chunker in tests
  • Flexibility: Users can plug in custom chunkers
  • Future-proof: Supports evolution of chunking approaches

Trade-off:

  • More code upfront (abstract base class)
  • ✅ Worth it: Enables extensibility without breaking changes

Decision 6: Utility Functions - Simple Over Generic

Approach: Utilities designed for character-based chunking with clear assumptions

predict_with_chunking(
    text=text,
    predict_func=any_prediction_function,  # Model-agnostic
    chunker=any_chunker_implementation      # Character-based expected
)

Key Assumptions:

  • Chunker has chunk_size (characters)
  • Chunker has chunk_overlap (characters)
  • Short-circuit check: len(text) <= chunker.chunk_size

Why Simple Design?

  • YAGNI Principle: Token-based chunking was rejected, no need to optimize for it
  • Clarity: Code is immediately understandable
  • Maintainability: Fewer abstractions = easier debugging
  • Strategy Pattern Already Provides Extensibility: New chunker types can be added

Alternative Considered: More Generic Interface

# Rejected approach
chunker.should_chunk(text)  # Instead of len() check
chunker.get_overlap_size()  # Dynamic overlap

Why Rejected:

  • ❌ Premature abstraction for unused feature
  • ❌ More complex interface
  • ❌ Harder to understand
  • ✅ Can refactor later if token-based is actually needed (unlikely)

Trade-off Accepted:

  • Future chunker types must work with character-based assumptions OR
  • Refactor utils if genuinely needed (straightforward change)

🏗️ Implementation Decisions

Offset Calculation

Critical Decision: How to map chunk entity positions to original text?

Approach Selected:

offset += len(chunk) - chunk_overlap

Why len(chunk) not chunk_size?

  • Word boundary extension creates variable-length chunks
  • Using actual length ensures accurate position mapping
  • Example: chunk_size=250, actual=273 (word boundary)

Validation:

  • Tested with word boundaries: ✅ Positions correct
  • Tested with CJK text (no spaces): ✅ Works
  • Tested with special characters: ✅ Accurate

Error Handling Philosophy

Decision: Trust upstream components (GLiNER), minimal defensive coding

Rationale from Code Review:

  • 23 potential issues identified
  • 22 were false alarms (defensive programming paranoia)
  • 1 real bug (parameter redundancy - fixed)

Approach:

  • ✅ Trust GLiNER to return valid predictions
  • ✅ Trust Python to handle edge cases (empty strings, etc.)
  • ❌ Avoid unnecessary validation code
  • ❌ Don't add error handling "just in case"

Example - Rejected Defensive Code:

# Considered but rejected
if pred["end"] > len(text):
    logger.warning("Entity beyond text")
    continue

Why rejected? GLiNER never returns invalid positions. Adding checks adds complexity for zero benefit.


📊 Trade-offs Summary

Performance vs Accuracy

Decision: Prioritize accuracy over raw speed

Aspect Choice Rationale
Overlap 20% duplication Catches boundary entities
Word boundaries Variable chunk size Better entity detection
Deduplication O(n²) algorithm Simple and correct

Performance Result:

  • 1,000 entities: 12,824/sec (fast enough)
  • Typical doc: 10-200 entities, < 0.1s overhead
  • ✅ No optimization needed yet

Simplicity vs Flexibility

Decision: Strategy pattern for future extensibility

Trade-off:

  • More code upfront (base class + concrete)
  • Benefit: Easy to add new chunking strategies
  • Verdict: ✅ Worth it - prevents future breaking changes

Character-based vs Token-based

Decision: Character-based for simplicity

Trade-off:

  • Less precise token control
  • Benefit: Model-agnostic, no tokenizer overhead
  • Mitigation: Large safety margin (57% buffer)
  • Verdict: ✅ Simplicity wins

🧪 Validation Approach

Testing Strategy

Test Coverage:

  • 27 unit tests
  • Edge cases: CJK text, newlines, empty strings, zero-length entities
  • Real-world scenarios: Long documents, overlapping entities

Key Test Insights:

  1. Word boundaries don't cause infinite loops (verified with CJK)
  2. Offset calculation handles variable chunks (verified with entity positions)
  3. Deduplication handles zero-length entities (edge case discovered)

Chunking Approaches Comparison

Aspect Current (Overlap) LangChain Pattern LangChain Dependency
Dependencies None None langchain-text-splitters (~2MB+)
Overlap ✅ Yes ❌ No ✅ Yes
Deduplication ✅ Required ❌ Not needed If overlap enabled
Separator Logic Word boundary \n\n\n → `` \n\n\n → ``
Boundary Accuracy Higher Lower Configurable
Maintenance Risk Low Low Medium (external)

Why Overlap + Deduplication?

Without overlap: Entity "Dr. John Smith" split across chunks → missed

With overlap: Same entity detected in multiple chunks → needs deduplication

🚀 Future Considerations

Immediate Monitoring Needs

  1. Chunk count distribution: How many chunks per document?
  2. Deduplication rate: How many duplicates removed?
  3. Latency impact: Overhead from chunking?

Potential Enhancements

Enhancement 1: Parallel Chunk Processing

When: If latency becomes issue (>1s per document)
Approach: Process chunks concurrently
Expected gain: 2-3x speedup for large documents

Enhancement 2: Adaptive Chunk Size

When: If we see frequent boundary misses
Approach: Adjust chunk size based on entity density
Trade-off: Added complexity vs marginal gain

Enhancement 3: Alternative Chunking Strategies

When: User needs semantic chunking
Approach: Implement via Strategy pattern

SemanticChunker()  # Preserves paragraphs
SentenceChunker()  # Natural sentence boundaries  
LangChainChunker() # Integration with LangChain

Enhancement 4: Deduplication Optimization

When: Typical document has >1000 entities
Approach: Spatial indexing (O(n log n))
Current: Not needed - O(n²) is fast enough


@RonShakutai RonShakutai self-requested a review December 3, 2025 10:29
Copy link
Collaborator

@RonShakutai RonShakutai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR would be a Great addition to presidio capabilities !! and probably used in other use cases.
Left a few comments

@RonShakutai RonShakutai requested a review from omri374 December 3, 2025 11:53
Copy link
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is great! left a few comments but I feel this is really close to being ready.

@RonShakutai
Copy link
Collaborator

Hi :) @jedheaj314
Are you still planning to complete this PR?

@jedheaj314
Copy link
Contributor Author

Hi :) @jedheaj314 Are you still planning to complete this PR?

Yes, working through the comments now

@jedheaj314 jedheaj314 closed this Jan 6, 2026
@jedheaj314 jedheaj314 reopened this Jan 6, 2026
@omri374
Copy link
Collaborator

omri374 commented Jan 15, 2026

Thanks @jedheaj314! This is shaping up quite nicely. I added a few comments on potential side effects of this on Presidio in general to think about, but overall the logic and architecture are solid.

@tamirkamara tamirkamara changed the title 1569 Fix gliner truncates text Fix gliner truncates text Jan 19, 2026
@microsoft microsoft deleted a comment from SharonHart Jan 19, 2026
@microsoft microsoft deleted a comment from SharonHart Jan 19, 2026
omri374
omri374 previously approved these changes Jan 22, 2026
Copy link
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jedheaj314! This is a great addition, and thank for the patience around the numerous rounds of feedback. I left a small comment on the e2e tests but overall this is approved :)

@jedheaj314 jedheaj314 force-pushed the jedheaj314/1569-fix-gliner-truncates-text branch from ce806c2 to 60816f3 Compare January 22, 2026 17:32
@RonShakutai RonShakutai removed their request for review January 26, 2026 16:44
@SharonHart SharonHart merged commit 5d92cf8 into microsoft:main Jan 27, 2026
86 of 88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GLiNER Recognizer Truncates Long Text, Leading to Poor Redaction Results

4 participants