Feature/enhanced label cleaning (#47)

p3nGu1nZz · web-flow · commit a0068d4a1fb0 · 2025-08-09T21:50:04.000-04:00
* Enhance span annotation handling and logging for zero-span sequences

* Add tests for processing sequences with zero spans and ensure they are skipped

* Implement XBarDictionary for managing hierarchical spans and enhance span annotator with dictionary integration

* Add hierarchical level classification for XBar labels in span annotator

* Refactor XBarDictionary and SpanAnnotatorPipeline for improved dictionary management and statistics tracking

* Enhance SpanAnnotatorPipeline to generate annotations from dictionary spans and add AnnotationValidator for validating annotations.jsonl files

* Refactor SpanAnnotatorPipeline and XBarDictionary by removing unused parameters and consolidating annotation generation logic

* Rename dictionary.jsonl to spans.jsonl in save and load methods for clarity

* Enhance SpanAnnotatorPipeline to build X-bar dictionaries and generate annotations.jsonl from working files

* Rename dictionary.jsonl to spans.jsonl in test assertions and update annotation analysis to reflect changes

* Add tests for XBar label cleaning functionality and enhance label validation in the pipeline

* Remove validate_annotations.py as it is no longer needed

* Enhance logging in SpanAnnotatorPipeline and XBarDictionary for better annotation tracking and validation. Add word span validation in XBarLabelMap to filter invalid spans.

* Enhance README and span_annotator documentation with advanced label cleaning system details, including comprehensive word span validation and intelligent logging improvements.

* Update multi-label test cases for improved label validation and structure preservation
diff --git a/README.md b/README.md
@@ -196,14 +196,19 @@ uv run -m x_spanformer.pipelines.span_annotator \
 This implements **production-grade agentic X-bar span annotation** with enhanced robustness, featuring:
 
 - **Enhanced JSON Parsing Robustness**: Handles truncated LLM responses, malformed JSON, and case-insensitive matching
+- **Advanced Label Cleaning System**: Comprehensive word span validation with pattern-based filtering
 - **Independent Boundary Prediction**: Generates training targets for start/end position classification using factorized linear heads
 - **X-bar Hierarchical Structure**: Domain-specific classifier extraction based on linguistic phrase structure theory
+- **Intelligent Logging**: Aggregated counts replace repetitive debug messages for cleaner output
+- **Word Span Validation**: Supports percentages, abbreviations, expressions, and complex patterns
 - **Position-wise Binary Classification**: Creates sigmoid-normalized boundary probabilities for BCE loss training
 - **Multi-label Span Support**: Handles overlapping spans at different hierarchical levels (word → phrase → clause)
 - **Production Validation**: Zero position errors across 1,703 spans in 56 sequences (August 2025)
 
 **Production Results (August 2025):**
-- **1,703 total spans** generated with 128.2% overlap ratio
+- **60,558 clean annotations** from 61,053 original spans (99.2% retention rate)
+- **495 invalid word spans** automatically filtered using pattern-based validation
+- **352 labels mapped** from invalid to valid categories with aggregated logging
 - **Zero validation errors** in position encoding and text extraction
 - **Perfect alignment** with factorized pointer network requirements
 - **Enhanced reliability** with automatic recovery from LLM response issues
@@ -222,13 +227,18 @@ data/annotations/
 
 **Key Features:**
 - **Enhanced JSON Parsing**: Robust handling of truncated LLM responses and malformed JSON
+- **Advanced Label Cleaning**: Comprehensive word span validation with pattern-based filtering
 - **Bidirectional Context**: Built on X-Spanformer's position-wise embedding architecture where each H[t] contains bidirectional contextual information
 - **Boundary Detection Training**: Generates binary targets for start/end position prediction (not span-level embeddings)
 - **Multi-label Support**: BCE loss handles overlapping spans at different hierarchical levels
 - **Production Validation**: Zero position or text extraction errors across all generated spans
-- **Improved Logging**: Concise sequence selection summaries replace verbose lists for better performance and readability
+- **Intelligent Logging**: Aggregated counts replace repetitive debug messages for cleaner output
 
 **Recent Enhancements (August 2025):**
+- **Advanced Label Cleaning**: Pattern-based word span validation supporting percentages, abbreviations, and expressions
+- **Intelligent Logging System**: Aggregated counts replace thousands of repetitive debug messages
+- **Enhanced Word Span Patterns**: Support for decimals ("3.14"), percentages ("2.7%"), abbreviations ("Dr."), expressions ("[83]", "(t)", "|s|")
+- **Production Cleaning Results**: 99.2% retention rate with 495 spans filtered and 352 labels mapped
 - **Logging Optimization**: `Selected 1000 sequences (1 to 1000) out of 1000 requested` instead of massive sequence lists
 - **Performance**: Reduced I/O overhead and log file size while maintaining essential debugging information
 - **Scalability**: Handles large sequence ranges without log bloat or memory issues
@@ -245,6 +255,7 @@ X-Spanformer includes comprehensive test coverage organized into focused categor
 - **`tests/pipelines/`** - Data processing pipeline tests
   - `test_pipelines_pdf2jsonl.py` - PDF→JSONL conversion with AI judging
   - `test_pipelines_jsonl2vocab.py` - Vocabulary induction (Section 3.1)
+  - `test_pipeline_span_annotator.py` - Span annotation pipeline tests with label cleaning validation
   - `test_pipelines_vocab2embedding.py` - Seed embeddings & span generation (Section 3.2)
   - `test_integration_vocab2embedding.py` - End-to-end integration validation
 
@@ -263,6 +274,10 @@ X-Spanformer includes comprehensive test coverage organized into focused categor
   - `test_span_annotator.py` - Span annotation pipeline tests
   - `test_e2e_ollama_client.py` - Ollama client integration
 
+- **`tests/xbar/`** - X-bar theory and label cleaning tests
+  - `test_xbar_map.py` - Label cleaning and word span validation tests
+  - `test_xbar_annotator.py` - X-bar annotation logic tests
+
 - **`tests/config/`** - Configuration system tests
   - `test_span_annotator_config.py` - Configuration loading with logging support
 
diff --git a/tests/xbar/test_xbar_label_cleaning.py b/tests/xbar/test_xbar_label_cleaning.py
@@ -69,8 +69,8 @@ def test_clean_and_validate_labels_basic(self):
     def test_clean_and_validate_labels_multi_label(self):
         """Test handling of multi-label cases."""
         annotations = [
-            {"xbar_label": "noun, punctuation", "text": "word,"},
-            {"xbar_label": "verb, adverb", "text": "run quickly"},
+            {"xbar_label": "noun, punctuation", "text": "word"},  # Just the word without punctuation
+            {"xbar_label": "adjective, noun", "text": "quick-brown"},  # Valid identifier-style word
         ]
         
         cleaned, stats = XBarLabelMap.clean_and_validate_labels(annotations)
@@ -83,8 +83,8 @@ def test_clean_and_validate_labels_multi_label(self):
         
         # Check that they map to first valid component
         labels = [ann["xbar_label"] for ann in cleaned]
-        assert "noun" in labels  # From "noun, punctuation"
-        assert "verb" in labels  # From "verb, adverb"
+        assert "noun" in labels  # From "noun, punctuation" 
+        assert "adjective" in labels  # From "adjective, noun"
     
     def test_clean_and_validate_labels_preserves_structure(self):
         """Test that cleaning preserves annotation structure."""
diff --git a/x_spanformer/pipelines/span_annotator.md b/x_spanformer/pipelines/span_annotator.md
@@ -8,8 +8,10 @@ The Unified Span Annotator Pipeline is a **production-ready, battle-tested** imp
 
 - **Three-turn conversation strategy**: Progressive analysis from word-level → phrase-level → clause-level
 - **Enhanced JSON parsing robustness**: Handles truncated LLM responses, malformed JSON, and case-insensitive text matching
+- **Advanced label cleaning system**: Comprehensive word span validation with pattern-based filtering
 - **Resume capability**: Automatically resumes from previous progress with gap detection
 - **Production-grade error handling**: Comprehensive span validation and position verification
+- **Intelligent logging system**: Aggregated counts instead of verbose repetitive messages
 - **Real-time telemetry**: Progress tracking with detailed span type statistics
 - **Multiple output formats**: Working files, consolidated results, and analysis reports
 - **Factorized pointer network ready**: Generates training data perfectly aligned with Section 3.3 architecture
@@ -21,25 +23,29 @@ The Unified Span Annotator Pipeline is a **production-ready, battle-tested** imp
 1. **SpanAnnotatorPipeline**: Main pipeline orchestrator with resume and gap detection
 2. **SpanAnnotatorSession**: Async session management with timeout controls
 3. **XBarAnnotator**: X-bar theory-based span extraction with enhanced JSON parsing
-4. **Output Management**: Working files, consolidation, metadata, and analysis reports
-5. **JSON Parsing Robustness**: Truncation detection, malformed JSON recovery, case-insensitive matching
+4. **XBarLabelMap**: Advanced label cleaning and word span validation system
+5. **Output Management**: Working files, consolidation, metadata, and analysis reports
+6. **JSON Parsing Robustness**: Truncation detection, malformed JSON recovery, case-insensitive matching
 
 ### Processing Flow
 
 ```
 Input (corpus.jsonl) → Load Sequences → Filter by Range → 
 Process in Batches → Annotate with XBar → Enhanced JSON Parsing → 
-Save Working Files → Position Validation → Consolidate Results → 
-Generate Metadata → Analysis Reports → Output
+Save Working Files → Position Validation → Label Cleaning & Word Span Validation → 
+Consolidate Results → Generate Metadata → Analysis Reports → Output
 ```
 
 ### Production Status (August 2025)
 
-**✅ PRODUCTION READY**: Successfully processing sequences with zero position errors
-- **1,703 total spans** generated across 56 sequences 
-- **128.2% overlap ratio** supporting multi-label boundary prediction
+**✅ PRODUCTION READY**: Successfully processing sequences with zero position errors and advanced label cleaning
+- **60,558 clean annotations** from 61,053 original spans (99.2% retention rate)
+- **495 invalid word spans** automatically filtered using pattern-based validation
+- **352 labels mapped** from invalid to valid categories with aggregated logging
 - **Zero validation errors** in position encoding and text extraction
 - **Enhanced JSON robustness** handling truncated LLM responses
+- **Intelligent logging system** with count aggregation instead of repetitive debug messages
+- **Comprehensive word span validation** supporting percentages, abbreviations, and expressions
 - **Perfect alignment** with factorized pointer network requirements (Section 3.3)
 
 ## Usage
@@ -399,6 +405,54 @@ def _extract_text_boundaries(self, text: str, target_text: str) -> Optional[Tupl
 - **Automatic recovery** from truncated responses at sequence 40
 - **Enhanced reliability** for large-scale annotation tasks
 
+## Enhanced Label Cleaning System
+
+### Advanced Word Span Validation
+
+The pipeline includes a comprehensive label cleaning and word span validation system that ensures high-quality training data:
+
+#### Pattern-Based Word Span Filtering
+- **Spaces detection**: Automatically removes spans containing spaces (not word-level)
+- **Mixed character validation**: Filters invalid combinations of letters, numbers, and special characters
+- **Identifier patterns**: Allows valid programming identifiers (letters + underscores/hyphens)
+- **Number formats**: Supports integers, decimals, negative numbers, and percentages
+- **Abbreviations**: Allows words with periods (e.g., "Dr.", "U.S.", "etc.")
+- **Expressions**: Supports bracketed `[83]`, parenthetical `(t)`, and pipe `|s|` expressions
+- **Trailing punctuation**: Allows words ending with colons ("words:")
+
+#### Label Mapping System
+- **Intelligent mapping**: Converts invalid labels to valid X-bar categories
+- **Aggregated logging**: Shows mapping counts instead of repetitive debug messages
+- **Statistical reporting**: Provides comprehensive cleaning statistics
+- **Zero data loss**: Maps rather than removes when possible
+
+#### Production Cleaning Results (August 2025)
+```
+Label cleaning results:
+  Valid labels (unchanged): 60,206
+  Invalid labels mapped: 352
+  Invalid labels removed: 0
+  Invalid word spans removed: 495
+  Total annotations before cleaning: 61,053
+  Total annotations after cleaning: 60,558
+  Total spans filtered: 495
+```
+
+**Key Statistics:**
+- **99.2% retention rate** - Only 495 spans (0.8%) filtered for quality
+- **Zero label removal** - All invalid labels successfully mapped to valid categories
+- **352 labels mapped** from variations like "proper noun" → "noun", "auxiliary" → "verb"
+- **Clean logging output** - Aggregated counts replace thousands of repetitive debug messages
+
+#### Supported Word Span Patterns
+- **Pure text**: `"transformer"`, `"attention"`
+- **Numbers**: `"42"`, `"3.14"`, `"-5"`, `"2.7%"`
+- **Identifiers**: `"attention_weights"`, `"multi-head"`
+- **Abbreviations**: `"Dr."`, `"U.S."`, `"etc."`
+- **Expressions**: `"[83]"`, `"(t)"`, `"|s|"`
+- **Punctuated**: `"words:"`, `"Note:"`
+- **Version numbers**: `"1.2.3"`, `"v2.0"`
+
 ## X-bar Theory Integration
 
 ### Linguistic Foundation
diff --git a/x_spanformer/pipelines/span_annotator.py b/x_spanformer/pipelines/span_annotator.py
@@ -410,6 +410,10 @@ def consolidate_results(self, output_dir: Path):
         logger.info(f"  Valid labels (unchanged): {mapping_stats['valid']}")
         logger.info(f"  Invalid labels mapped: {mapping_stats['mapped']}")
         logger.info(f"  Invalid labels removed: {mapping_stats['removed']}")
+        logger.info(f"  Invalid word spans removed: {mapping_stats['invalid_word_spans']}")
+        logger.info(f"  Total annotations before cleaning: {len(all_annotations)}")
+        logger.info(f"  Total annotations after cleaning: {len(cleaned_annotations)}")
+        logger.info(f"  Total spans filtered: {len(all_annotations) - len(cleaned_annotations)}")
         
         # Write annotations.jsonl file
         with open(annotations_file, 'w', encoding='utf-8') as f:
@@ -418,6 +422,16 @@ def consolidate_results(self, output_dir: Path):
         
         logger.info(f"Generated {len(cleaned_annotations)} annotation records in annotations.jsonl")
         
+        # Log label mapping summary if any mappings occurred
+        if hasattr(self, '_label_mappings') and self._label_mappings:
+            total_mappings = sum(self._label_mappings.values())
+            unique_labels = len(self._label_mappings)
+            logger.info(f"Label mapping summary: {total_mappings} total mappings for {unique_labels} unique labels")
+            # Show top 5 most common mappings for debugging
+            top_mappings = sorted(self._label_mappings.items(), key=lambda x: x[1], reverse=True)[:5]
+            for label, count in top_mappings:
+                logger.debug(f"  '{label}': {count} occurrences")
+        
         # Build dictionaries from collected spans
         logger.info("Building X-bar dictionaries from processed spans...")
         total_new_spans = 0
@@ -488,7 +502,10 @@ def _determine_hierarchical_level(self, xbar_label: str) -> Optional[str]:
             # Use the helper function from xbar_map for unknown labels
             mapped_level = XBarLabelMap.get_hierarchical_level(xbar_label)
             if mapped_level:
-                logger.debug(f"Mapped unknown label '{xbar_label}' to '{mapped_level}'")
+                # Count mappings instead of logging each one
+                if not hasattr(self, '_label_mappings'):
+                    self._label_mappings = {}
+                self._label_mappings[xbar_label] = self._label_mappings.get(xbar_label, 0) + 1
                 return mapped_level
             else:
                 logger.warning(f"Unknown hierarchical level for label: {xbar_label}")
diff --git a/x_spanformer/xbar/xbar_dict.py b/x_spanformer/xbar/xbar_dict.py
@@ -83,8 +83,7 @@ def add_spans(self, domain_type: str, hierarchical_level: str, spans: List[str])
                 level_dict.add(span.strip())
         
         new_count = len(level_dict) - initial_count
-        if new_count > 0:
-            logger.debug(f"Added {new_count} new spans to {domain_type}.{hierarchical_level} (total: {len(level_dict)})")
+        # Removed individual level logging to reduce spam
             
         return new_count
     
@@ -246,20 +245,26 @@ def log_statistics(self):
         logger.info(f"Total unique spans: {stats['total_unique_spans']}")
         
         logger.info("Domain distribution:")
+        total_spans = stats['total_unique_spans']
         for domain, count in stats["domain_totals"].items():
-            logger.info(f"  {domain}: {count} unique spans")
+            percentage = (count / total_spans * 100) if total_spans > 0 else 0
+            logger.info(f"  {domain}: {count} unique spans ({percentage:.1f}%)")
         
         logger.info("Level distribution:")
         for level, count in stats["level_totals"].items():
-            logger.info(f"  {level}: {count} unique spans")
+            percentage = (count / total_spans * 100) if total_spans > 0 else 0
+            logger.info(f"  {level}: {count} unique spans ({percentage:.1f}%)")
         
         logger.info("Detailed breakdown:")
         for domain, domain_stats in stats["domains"].items():
+            domain_total = domain_stats['total']
             logger.info(f"  {domain}:")
             for level, count in domain_stats.items():
                 if level != "total":
-                    logger.info(f"    {level}: {count}")
-            logger.info(f"    total: {domain_stats['total']}")
+                    level_percentage = (count / domain_total * 100) if domain_total > 0 else 0
+                    logger.info(f"    {level}: {count} ({level_percentage:.1f}%)")
+            total_percentage = (domain_total / total_spans * 100) if total_spans > 0 else 0
+            logger.info(f"    total: {domain_total} ({total_percentage:.1f}%)")
         logger.info("=" * 40)
     
     def save_dictionaries(self, output_dir: Path) -> int:
diff --git a/x_spanformer/xbar/xbar_map.py b/x_spanformer/xbar/xbar_map.py