Skip to content

Commit a0068d4

Browse files
authored
Feature/enhanced label cleaning (#47)
* Enhance span annotation handling and logging for zero-span sequences * Add tests for processing sequences with zero spans and ensure they are skipped * Implement XBarDictionary for managing hierarchical spans and enhance span annotator with dictionary integration * Add hierarchical level classification for XBar labels in span annotator * Refactor XBarDictionary and SpanAnnotatorPipeline for improved dictionary management and statistics tracking * Enhance SpanAnnotatorPipeline to generate annotations from dictionary spans and add AnnotationValidator for validating annotations.jsonl files * Refactor SpanAnnotatorPipeline and XBarDictionary by removing unused parameters and consolidating annotation generation logic * Rename dictionary.jsonl to spans.jsonl in save and load methods for clarity * Enhance SpanAnnotatorPipeline to build X-bar dictionaries and generate annotations.jsonl from working files * Rename dictionary.jsonl to spans.jsonl in test assertions and update annotation analysis to reflect changes * Add tests for XBar label cleaning functionality and enhance label validation in the pipeline * Remove validate_annotations.py as it is no longer needed * Enhance logging in SpanAnnotatorPipeline and XBarDictionary for better annotation tracking and validation. Add word span validation in XBarLabelMap to filter invalid spans. * Enhance README and span_annotator documentation with advanced label cleaning system details, including comprehensive word span validation and intelligent logging improvements. * Update multi-label test cases for improved label validation and structure preservation
1 parent 3a9c6ec commit a0068d4

File tree

6 files changed

+244
-22
lines changed

6 files changed

+244
-22
lines changed

README.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,14 +196,19 @@ uv run -m x_spanformer.pipelines.span_annotator \
196196
This implements **production-grade agentic X-bar span annotation** with enhanced robustness, featuring:
197197

198198
- **Enhanced JSON Parsing Robustness**: Handles truncated LLM responses, malformed JSON, and case-insensitive matching
199+
- **Advanced Label Cleaning System**: Comprehensive word span validation with pattern-based filtering
199200
- **Independent Boundary Prediction**: Generates training targets for start/end position classification using factorized linear heads
200201
- **X-bar Hierarchical Structure**: Domain-specific classifier extraction based on linguistic phrase structure theory
202+
- **Intelligent Logging**: Aggregated counts replace repetitive debug messages for cleaner output
203+
- **Word Span Validation**: Supports percentages, abbreviations, expressions, and complex patterns
201204
- **Position-wise Binary Classification**: Creates sigmoid-normalized boundary probabilities for BCE loss training
202205
- **Multi-label Span Support**: Handles overlapping spans at different hierarchical levels (word → phrase → clause)
203206
- **Production Validation**: Zero position errors across 1,703 spans in 56 sequences (August 2025)
204207

205208
**Production Results (August 2025):**
206-
- **1,703 total spans** generated with 128.2% overlap ratio
209+
- **60,558 clean annotations** from 61,053 original spans (99.2% retention rate)
210+
- **495 invalid word spans** automatically filtered using pattern-based validation
211+
- **352 labels mapped** from invalid to valid categories with aggregated logging
207212
- **Zero validation errors** in position encoding and text extraction
208213
- **Perfect alignment** with factorized pointer network requirements
209214
- **Enhanced reliability** with automatic recovery from LLM response issues
@@ -222,13 +227,18 @@ data/annotations/
222227

223228
**Key Features:**
224229
- **Enhanced JSON Parsing**: Robust handling of truncated LLM responses and malformed JSON
230+
- **Advanced Label Cleaning**: Comprehensive word span validation with pattern-based filtering
225231
- **Bidirectional Context**: Built on X-Spanformer's position-wise embedding architecture where each H[t] contains bidirectional contextual information
226232
- **Boundary Detection Training**: Generates binary targets for start/end position prediction (not span-level embeddings)
227233
- **Multi-label Support**: BCE loss handles overlapping spans at different hierarchical levels
228234
- **Production Validation**: Zero position or text extraction errors across all generated spans
229-
- **Improved Logging**: Concise sequence selection summaries replace verbose lists for better performance and readability
235+
- **Intelligent Logging**: Aggregated counts replace repetitive debug messages for cleaner output
230236

231237
**Recent Enhancements (August 2025):**
238+
- **Advanced Label Cleaning**: Pattern-based word span validation supporting percentages, abbreviations, and expressions
239+
- **Intelligent Logging System**: Aggregated counts replace thousands of repetitive debug messages
240+
- **Enhanced Word Span Patterns**: Support for decimals ("3.14"), percentages ("2.7%"), abbreviations ("Dr."), expressions ("[83]", "(t)", "|s|")
241+
- **Production Cleaning Results**: 99.2% retention rate with 495 spans filtered and 352 labels mapped
232242
- **Logging Optimization**: `Selected 1000 sequences (1 to 1000) out of 1000 requested` instead of massive sequence lists
233243
- **Performance**: Reduced I/O overhead and log file size while maintaining essential debugging information
234244
- **Scalability**: Handles large sequence ranges without log bloat or memory issues
@@ -245,6 +255,7 @@ X-Spanformer includes comprehensive test coverage organized into focused categor
245255
- **`tests/pipelines/`** - Data processing pipeline tests
246256
- `test_pipelines_pdf2jsonl.py` - PDF→JSONL conversion with AI judging
247257
- `test_pipelines_jsonl2vocab.py` - Vocabulary induction (Section 3.1)
258+
- `test_pipeline_span_annotator.py` - Span annotation pipeline tests with label cleaning validation
248259
- `test_pipelines_vocab2embedding.py` - Seed embeddings & span generation (Section 3.2)
249260
- `test_integration_vocab2embedding.py` - End-to-end integration validation
250261

@@ -263,6 +274,10 @@ X-Spanformer includes comprehensive test coverage organized into focused categor
263274
- `test_span_annotator.py` - Span annotation pipeline tests
264275
- `test_e2e_ollama_client.py` - Ollama client integration
265276

277+
- **`tests/xbar/`** - X-bar theory and label cleaning tests
278+
- `test_xbar_map.py` - Label cleaning and word span validation tests
279+
- `test_xbar_annotator.py` - X-bar annotation logic tests
280+
266281
- **`tests/config/`** - Configuration system tests
267282
- `test_span_annotator_config.py` - Configuration loading with logging support
268283

tests/xbar/test_xbar_label_cleaning.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,8 @@ def test_clean_and_validate_labels_basic(self):
6969
def test_clean_and_validate_labels_multi_label(self):
7070
"""Test handling of multi-label cases."""
7171
annotations = [
72-
{"xbar_label": "noun, punctuation", "text": "word,"},
73-
{"xbar_label": "verb, adverb", "text": "run quickly"},
72+
{"xbar_label": "noun, punctuation", "text": "word"}, # Just the word without punctuation
73+
{"xbar_label": "adjective, noun", "text": "quick-brown"}, # Valid identifier-style word
7474
]
7575

7676
cleaned, stats = XBarLabelMap.clean_and_validate_labels(annotations)
@@ -83,8 +83,8 @@ def test_clean_and_validate_labels_multi_label(self):
8383

8484
# Check that they map to first valid component
8585
labels = [ann["xbar_label"] for ann in cleaned]
86-
assert "noun" in labels # From "noun, punctuation"
87-
assert "verb" in labels # From "verb, adverb"
86+
assert "noun" in labels # From "noun, punctuation"
87+
assert "adjective" in labels # From "adjective, noun"
8888

8989
def test_clean_and_validate_labels_preserves_structure(self):
9090
"""Test that cleaning preserves annotation structure."""

x_spanformer/pipelines/span_annotator.md

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,10 @@ The Unified Span Annotator Pipeline is a **production-ready, battle-tested** imp
88

99
- **Three-turn conversation strategy**: Progressive analysis from word-level → phrase-level → clause-level
1010
- **Enhanced JSON parsing robustness**: Handles truncated LLM responses, malformed JSON, and case-insensitive text matching
11+
- **Advanced label cleaning system**: Comprehensive word span validation with pattern-based filtering
1112
- **Resume capability**: Automatically resumes from previous progress with gap detection
1213
- **Production-grade error handling**: Comprehensive span validation and position verification
14+
- **Intelligent logging system**: Aggregated counts instead of verbose repetitive messages
1315
- **Real-time telemetry**: Progress tracking with detailed span type statistics
1416
- **Multiple output formats**: Working files, consolidated results, and analysis reports
1517
- **Factorized pointer network ready**: Generates training data perfectly aligned with Section 3.3 architecture
@@ -21,25 +23,29 @@ The Unified Span Annotator Pipeline is a **production-ready, battle-tested** imp
2123
1. **SpanAnnotatorPipeline**: Main pipeline orchestrator with resume and gap detection
2224
2. **SpanAnnotatorSession**: Async session management with timeout controls
2325
3. **XBarAnnotator**: X-bar theory-based span extraction with enhanced JSON parsing
24-
4. **Output Management**: Working files, consolidation, metadata, and analysis reports
25-
5. **JSON Parsing Robustness**: Truncation detection, malformed JSON recovery, case-insensitive matching
26+
4. **XBarLabelMap**: Advanced label cleaning and word span validation system
27+
5. **Output Management**: Working files, consolidation, metadata, and analysis reports
28+
6. **JSON Parsing Robustness**: Truncation detection, malformed JSON recovery, case-insensitive matching
2629

2730
### Processing Flow
2831

2932
```
3033
Input (corpus.jsonl) → Load Sequences → Filter by Range →
3134
Process in Batches → Annotate with XBar → Enhanced JSON Parsing →
32-
Save Working Files → Position Validation → Consolidate Results
33-
Generate Metadata → Analysis Reports → Output
35+
Save Working Files → Position Validation → Label Cleaning & Word Span Validation
36+
Consolidate Results → Generate Metadata → Analysis Reports → Output
3437
```
3538

3639
### Production Status (August 2025)
3740

38-
**✅ PRODUCTION READY**: Successfully processing sequences with zero position errors
39-
- **1,703 total spans** generated across 56 sequences
40-
- **128.2% overlap ratio** supporting multi-label boundary prediction
41+
**✅ PRODUCTION READY**: Successfully processing sequences with zero position errors and advanced label cleaning
42+
- **60,558 clean annotations** from 61,053 original spans (99.2% retention rate)
43+
- **495 invalid word spans** automatically filtered using pattern-based validation
44+
- **352 labels mapped** from invalid to valid categories with aggregated logging
4145
- **Zero validation errors** in position encoding and text extraction
4246
- **Enhanced JSON robustness** handling truncated LLM responses
47+
- **Intelligent logging system** with count aggregation instead of repetitive debug messages
48+
- **Comprehensive word span validation** supporting percentages, abbreviations, and expressions
4349
- **Perfect alignment** with factorized pointer network requirements (Section 3.3)
4450

4551
## Usage
@@ -399,6 +405,54 @@ def _extract_text_boundaries(self, text: str, target_text: str) -> Optional[Tupl
399405
- **Automatic recovery** from truncated responses at sequence 40
400406
- **Enhanced reliability** for large-scale annotation tasks
401407
408+
## Enhanced Label Cleaning System
409+
410+
### Advanced Word Span Validation
411+
412+
The pipeline includes a comprehensive label cleaning and word span validation system that ensures high-quality training data:
413+
414+
#### Pattern-Based Word Span Filtering
415+
- **Spaces detection**: Automatically removes spans containing spaces (not word-level)
416+
- **Mixed character validation**: Filters invalid combinations of letters, numbers, and special characters
417+
- **Identifier patterns**: Allows valid programming identifiers (letters + underscores/hyphens)
418+
- **Number formats**: Supports integers, decimals, negative numbers, and percentages
419+
- **Abbreviations**: Allows words with periods (e.g., "Dr.", "U.S.", "etc.")
420+
- **Expressions**: Supports bracketed `[83]`, parenthetical `(t)`, and pipe `|s|` expressions
421+
- **Trailing punctuation**: Allows words ending with colons ("words:")
422+
423+
#### Label Mapping System
424+
- **Intelligent mapping**: Converts invalid labels to valid X-bar categories
425+
- **Aggregated logging**: Shows mapping counts instead of repetitive debug messages
426+
- **Statistical reporting**: Provides comprehensive cleaning statistics
427+
- **Zero data loss**: Maps rather than removes when possible
428+
429+
#### Production Cleaning Results (August 2025)
430+
```
431+
Label cleaning results:
432+
Valid labels (unchanged): 60,206
433+
Invalid labels mapped: 352
434+
Invalid labels removed: 0
435+
Invalid word spans removed: 495
436+
Total annotations before cleaning: 61,053
437+
Total annotations after cleaning: 60,558
438+
Total spans filtered: 495
439+
```
440+
441+
**Key Statistics:**
442+
- **99.2% retention rate** - Only 495 spans (0.8%) filtered for quality
443+
- **Zero label removal** - All invalid labels successfully mapped to valid categories
444+
- **352 labels mapped** from variations like "proper noun""noun", "auxiliary""verb"
445+
- **Clean logging output** - Aggregated counts replace thousands of repetitive debug messages
446+
447+
#### Supported Word Span Patterns
448+
- **Pure text**: `"transformer"`, `"attention"`
449+
- **Numbers**: `"42"`, `"3.14"`, `"-5"`, `"2.7%"`
450+
- **Identifiers**: `"attention_weights"`, `"multi-head"`
451+
- **Abbreviations**: `"Dr."`, `"U.S."`, `"etc."`
452+
- **Expressions**: `"[83]"`, `"(t)"`, `"|s|"`
453+
- **Punctuated**: `"words:"`, `"Note:"`
454+
- **Version numbers**: `"1.2.3"`, `"v2.0"`
455+
402456
## X-bar Theory Integration
403457
404458
### Linguistic Foundation

x_spanformer/pipelines/span_annotator.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -410,6 +410,10 @@ def consolidate_results(self, output_dir: Path):
410410
logger.info(f" Valid labels (unchanged): {mapping_stats['valid']}")
411411
logger.info(f" Invalid labels mapped: {mapping_stats['mapped']}")
412412
logger.info(f" Invalid labels removed: {mapping_stats['removed']}")
413+
logger.info(f" Invalid word spans removed: {mapping_stats['invalid_word_spans']}")
414+
logger.info(f" Total annotations before cleaning: {len(all_annotations)}")
415+
logger.info(f" Total annotations after cleaning: {len(cleaned_annotations)}")
416+
logger.info(f" Total spans filtered: {len(all_annotations) - len(cleaned_annotations)}")
413417

414418
# Write annotations.jsonl file
415419
with open(annotations_file, 'w', encoding='utf-8') as f:
@@ -418,6 +422,16 @@ def consolidate_results(self, output_dir: Path):
418422

419423
logger.info(f"Generated {len(cleaned_annotations)} annotation records in annotations.jsonl")
420424

425+
# Log label mapping summary if any mappings occurred
426+
if hasattr(self, '_label_mappings') and self._label_mappings:
427+
total_mappings = sum(self._label_mappings.values())
428+
unique_labels = len(self._label_mappings)
429+
logger.info(f"Label mapping summary: {total_mappings} total mappings for {unique_labels} unique labels")
430+
# Show top 5 most common mappings for debugging
431+
top_mappings = sorted(self._label_mappings.items(), key=lambda x: x[1], reverse=True)[:5]
432+
for label, count in top_mappings:
433+
logger.debug(f" '{label}': {count} occurrences")
434+
421435
# Build dictionaries from collected spans
422436
logger.info("Building X-bar dictionaries from processed spans...")
423437
total_new_spans = 0
@@ -488,7 +502,10 @@ def _determine_hierarchical_level(self, xbar_label: str) -> Optional[str]:
488502
# Use the helper function from xbar_map for unknown labels
489503
mapped_level = XBarLabelMap.get_hierarchical_level(xbar_label)
490504
if mapped_level:
491-
logger.debug(f"Mapped unknown label '{xbar_label}' to '{mapped_level}'")
505+
# Count mappings instead of logging each one
506+
if not hasattr(self, '_label_mappings'):
507+
self._label_mappings = {}
508+
self._label_mappings[xbar_label] = self._label_mappings.get(xbar_label, 0) + 1
492509
return mapped_level
493510
else:
494511
logger.warning(f"Unknown hierarchical level for label: {xbar_label}")

x_spanformer/xbar/xbar_dict.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,7 @@ def add_spans(self, domain_type: str, hierarchical_level: str, spans: List[str])
8383
level_dict.add(span.strip())
8484

8585
new_count = len(level_dict) - initial_count
86-
if new_count > 0:
87-
logger.debug(f"Added {new_count} new spans to {domain_type}.{hierarchical_level} (total: {len(level_dict)})")
86+
# Removed individual level logging to reduce spam
8887

8988
return new_count
9089

@@ -246,20 +245,26 @@ def log_statistics(self):
246245
logger.info(f"Total unique spans: {stats['total_unique_spans']}")
247246

248247
logger.info("Domain distribution:")
248+
total_spans = stats['total_unique_spans']
249249
for domain, count in stats["domain_totals"].items():
250-
logger.info(f" {domain}: {count} unique spans")
250+
percentage = (count / total_spans * 100) if total_spans > 0 else 0
251+
logger.info(f" {domain}: {count} unique spans ({percentage:.1f}%)")
251252

252253
logger.info("Level distribution:")
253254
for level, count in stats["level_totals"].items():
254-
logger.info(f" {level}: {count} unique spans")
255+
percentage = (count / total_spans * 100) if total_spans > 0 else 0
256+
logger.info(f" {level}: {count} unique spans ({percentage:.1f}%)")
255257

256258
logger.info("Detailed breakdown:")
257259
for domain, domain_stats in stats["domains"].items():
260+
domain_total = domain_stats['total']
258261
logger.info(f" {domain}:")
259262
for level, count in domain_stats.items():
260263
if level != "total":
261-
logger.info(f" {level}: {count}")
262-
logger.info(f" total: {domain_stats['total']}")
264+
level_percentage = (count / domain_total * 100) if domain_total > 0 else 0
265+
logger.info(f" {level}: {count} ({level_percentage:.1f}%)")
266+
total_percentage = (domain_total / total_spans * 100) if total_spans > 0 else 0
267+
logger.info(f" total: {domain_total} ({total_percentage:.1f}%)")
263268
logger.info("=" * 40)
264269

265270
def save_dictionaries(self, output_dir: Path) -> int:

0 commit comments

Comments
 (0)