-
Notifications
You must be signed in to change notification settings - Fork 915
Fix gliner truncates text #1805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gliner truncates text #1805
Conversation
…l to chunking from gliner recognizer
RonShakutai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR would be a Great addition to presidio capabilities !! and probably used in other use cases.
Left a few comments
presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py
Show resolved
Hide resolved
omri374
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is great! left a few comments but I feel this is really close to being ready.
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py
Show resolved
Hide resolved
|
Hi :) @jedheaj314 |
Yes, working through the comments now |
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py
Outdated
Show resolved
Hide resolved
…ates-text' into jedheaj314/1569-fix-gliner-truncates-text
|
Thanks @jedheaj314! This is shaping up quite nicely. I added a few comments on potential side effects of this on Presidio in general to think about, but overall the logic and architecture are solid. |
omri374
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jedheaj314! This is a great addition, and thank for the patience around the numerous rounds of feedback. I left a small comment on the e2e tests but overall this is approved :)
ce806c2 to
60816f3
Compare
Change Description
Change to be able to to call recogniser with chunker and call text chunking (character based) when calling analyse in gliner recogniser.
Issue reference
Fixes #1569
Checklist
Architectural/Technical Decisions:
🎯 Problem Statement
Issue #1569: GLiNER checkpoint has a maximum sequence length of 384 tokens. Documents exceeding this limit are silently truncated with a warning "UserWarning: Sentence of length 20415 has been truncated to 384".
Impact:
Root Cause: GLiNER truncating text to fixed context window of 384 tokens to detect PIIs.
🔍 Approach Evaluation
Option 1: Increase Model Context Window
Description: Retrain or use larger GLiNER model with higher token limit
Decision: ❌ Rejected - Not feasible, GLiNER architecture is fixed
Option 2: Token-Based Chunking
Description: Use GLiNER's tokenizer to chunk at exact token boundaries
Decision: ❌ Rejected - Complexity outweighs benefits
Option 3: Character-Based Chunking ✅
Description: Split text by character count with configurable overlap
Decision: ✅ SELECTED - Best balance of simplicity and effectiveness
Rationale:
Option 4: Sentence-Based Chunking
Description: Split on sentence boundaries using NLP
Decision: ❌ Rejected - Unnecessary complexity for current needs
Future Consideration: Could implement as alternative strategy later
🎛️ Key Design Decisions
Decision 1: Chunk Size = 250 characters
Options Considered:
Analysis:
Why 250?
Decision 2: Overlap = 50 characters (20%)
Problem: Entities at chunk boundaries might be split
Options Considered:
Why 20%?
Decision 3: Word Boundary Preservation
Problem: Mid-word breaks confuse NER models
Options Considered:
Trade-off Accepted:
Decision 4: Deduplication Strategy
Problem: Overlapping chunks produce duplicate entities
Approach: Score-based deduplication with overlap threshold
Why 50% threshold?
Alternative Considered:
Decision 5: Architecture Pattern = Strategy Pattern
Why Strategy Pattern?
Benefits:
Trade-off:
Decision 6: Utility Functions - Simple Over Generic
Approach: Utilities designed for character-based chunking with clear assumptions
Key Assumptions:
chunk_size(characters)chunk_overlap(characters)len(text) <= chunker.chunk_sizeWhy Simple Design?
Alternative Considered: More Generic Interface
Why Rejected:
Trade-off Accepted:
🏗️ Implementation Decisions
Offset Calculation
Critical Decision: How to map chunk entity positions to original text?
Approach Selected:
Why
len(chunk)notchunk_size?Validation:
Error Handling Philosophy
Decision: Trust upstream components (GLiNER), minimal defensive coding
Rationale from Code Review:
Approach:
Example - Rejected Defensive Code:
Why rejected? GLiNER never returns invalid positions. Adding checks adds complexity for zero benefit.
📊 Trade-offs Summary
Performance vs Accuracy
Decision: Prioritize accuracy over raw speed
Performance Result:
Simplicity vs Flexibility
Decision: Strategy pattern for future extensibility
Trade-off:
Character-based vs Token-based
Decision: Character-based for simplicity
Trade-off:
🧪 Validation Approach
Testing Strategy
Test Coverage:
Key Test Insights:
Chunking Approaches Comparison
langchain-text-splitters(~2MB+)\n\n→\n→→ ``\n\n→\n→→ ``Why Overlap + Deduplication?
Without overlap: Entity "Dr. John Smith" split across chunks → missed
With overlap: Same entity detected in multiple chunks → needs deduplication
🚀 Future Considerations
Immediate Monitoring Needs
Potential Enhancements
Enhancement 1: Parallel Chunk Processing
When: If latency becomes issue (>1s per document)
Approach: Process chunks concurrently
Expected gain: 2-3x speedup for large documents
Enhancement 2: Adaptive Chunk Size
When: If we see frequent boundary misses
Approach: Adjust chunk size based on entity density
Trade-off: Added complexity vs marginal gain
Enhancement 3: Alternative Chunking Strategies
When: User needs semantic chunking
Approach: Implement via Strategy pattern
Enhancement 4: Deduplication Optimization
When: Typical document has >1000 entities
Approach: Spatial indexing (O(n log n))
Current: Not needed - O(n²) is fast enough