|
| 1 | +# Enhanced Ingestion System with Transformers |
| 2 | + |
| 3 | +This directory contains the sophisticated ingestion engine that powers content extraction, entity recognition, relationship detection, and semantic understanding using state-of-the-art transformer models. |
| 4 | + |
| 5 | +## 🚀 Features |
| 6 | + |
| 7 | +### Entity Extraction with spaCy Transformers |
| 8 | +- **Transformer-based NER**: Uses `en_core_web_trf` for superior entity recognition |
| 9 | +- **Custom Entity Patterns**: Domain-specific patterns for technology, projects, etc. |
| 10 | +- **GPU Acceleration**: Optional GPU support for faster processing |
| 11 | +- **Fallback Models**: Gracefully degrades to smaller models if transformers unavailable |
| 12 | +- **Entity Context**: Extract entities with surrounding context for better understanding |
| 13 | + |
| 14 | +### Relationship Detection |
| 15 | +- **Dependency Parsing**: Advanced syntax analysis for relationship extraction |
| 16 | +- **Transformer Similarity**: Semantic similarity using transformer embeddings |
| 17 | +- **Pattern Matching**: Rule-based patterns for common relationships |
| 18 | +- **Contextual Analysis**: Understands relationships based on surrounding text |
| 19 | +- **Multiple Detection Methods**: Combines multiple approaches for better accuracy |
| 20 | + |
| 21 | +### Automatic Embedding Generation |
| 22 | +- **Multiple Models**: Support for OpenAI, Sentence-Transformers, and more |
| 23 | +- **Auto Model Selection**: Automatically chooses best available model |
| 24 | +- **Chunking Support**: Handles long documents with overlapping chunks |
| 25 | +- **Caching**: Efficient embedding reuse for repeated content |
| 26 | +- **Dimension Reduction**: Can reduce embedding dimensions when needed |
| 27 | + |
| 28 | +### Intent Classification |
| 29 | +- **Zero-Shot Classification**: Uses BART for intent recognition without training |
| 30 | +- **Pattern-Based Fallback**: Rule-based detection when transformers unavailable |
| 31 | +- **Action Item Extraction**: Automatically identifies TODOs and action items |
| 32 | +- **Urgency Detection**: Assesses content urgency based on linguistic cues |
| 33 | +- **Sentiment Analysis**: Optional sentiment scoring using TextBlob |
| 34 | + |
| 35 | +## 📦 Installation |
| 36 | + |
| 37 | +1. Install required packages: |
| 38 | +```bash |
| 39 | +pip install -r requirements.txt |
| 40 | +``` |
| 41 | + |
| 42 | +2. Download spaCy models: |
| 43 | +```bash |
| 44 | +# Transformer model (best quality, ~500MB) |
| 45 | +python -m spacy download en_core_web_trf |
| 46 | + |
| 47 | +# Large model (good quality, ~800MB) |
| 48 | +python -m spacy download en_core_web_lg |
| 49 | + |
| 50 | +# Small model (fast, ~50MB) |
| 51 | +python -m spacy download en_core_web_sm |
| 52 | +``` |
| 53 | + |
| 54 | +3. Optional: Install CUDA for GPU acceleration |
| 55 | +```bash |
| 56 | +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
| 57 | +``` |
| 58 | + |
| 59 | +## 🔧 Usage |
| 60 | + |
| 61 | +### Basic Usage |
| 62 | + |
| 63 | +```python |
| 64 | +from app.ingestion.core_extraction_pipeline import CoreExtractionPipeline |
| 65 | +from app.ingestion.models import IngestionRequest |
| 66 | + |
| 67 | +# Initialize pipeline |
| 68 | +pipeline = CoreExtractionPipeline(use_gpu=False) |
| 69 | + |
| 70 | +# Create request |
| 71 | +request = IngestionRequest( |
| 72 | + content="Apple Inc. announced the iPhone 15 with AI features.", |
| 73 | + extract_entities=True, |
| 74 | + extract_relationships=True, |
| 75 | + detect_intent=True, |
| 76 | + generate_embeddings=True |
| 77 | +) |
| 78 | + |
| 79 | +# Process content |
| 80 | +response = await pipeline.process(request) |
| 81 | + |
| 82 | +# Access results |
| 83 | +if response.status == "completed": |
| 84 | + content = response.processed_content |
| 85 | + print(f"Entities: {[e.text for e in content.entities]}") |
| 86 | + print(f"Intent: {content.intent.type.value}") |
| 87 | +``` |
| 88 | + |
| 89 | +### Advanced Configuration |
| 90 | + |
| 91 | +```python |
| 92 | +from app.ingestion.models import IngestionConfig |
| 93 | + |
| 94 | +# Custom configuration |
| 95 | +config = IngestionConfig( |
| 96 | + entity_model="en_core_web_trf", |
| 97 | + embedding_model="all-mpnet-base-v2", |
| 98 | + min_entity_confidence=0.8, |
| 99 | + min_relationship_confidence=0.7, |
| 100 | + enable_custom_entities=True, |
| 101 | + chunk_size=1000, |
| 102 | + chunk_overlap=200 |
| 103 | +) |
| 104 | + |
| 105 | +pipeline = CoreExtractionPipeline(config=config, use_gpu=True) |
| 106 | +``` |
| 107 | + |
| 108 | +### Batch Processing |
| 109 | + |
| 110 | +```python |
| 111 | +# Process multiple documents efficiently |
| 112 | +requests = [ |
| 113 | + IngestionRequest(content=doc) for doc in documents |
| 114 | +] |
| 115 | + |
| 116 | +responses = await pipeline.batch_process(requests) |
| 117 | +``` |
| 118 | + |
| 119 | +## 🏗️ Architecture |
| 120 | + |
| 121 | +### Core Components |
| 122 | + |
| 123 | +1. **EntityExtractor** (`entity_extractor.py`) |
| 124 | + - SpaCy transformer models |
| 125 | + - Custom entity patterns |
| 126 | + - Confidence scoring |
| 127 | + - Entity normalization |
| 128 | + |
| 129 | +2. **RelationshipDetector** (`relationship_detector.py`) |
| 130 | + - Dependency parsing |
| 131 | + - Transformer embeddings |
| 132 | + - Pattern matching |
| 133 | + - Proximity analysis |
| 134 | + |
| 135 | +3. **EmbeddingGenerator** (`embedding_generator.py`) |
| 136 | + - Multiple model support |
| 137 | + - Async generation |
| 138 | + - Chunking strategies |
| 139 | + - Similarity calculations |
| 140 | + |
| 141 | +4. **IntentRecognizer** (`intent_recognizer.py`) |
| 142 | + - Zero-shot classification |
| 143 | + - Pattern matching |
| 144 | + - Action item extraction |
| 145 | + - Urgency assessment |
| 146 | + |
| 147 | +5. **CoreExtractionPipeline** (`core_extraction_pipeline.py`) |
| 148 | + - Orchestrates all components |
| 149 | + - Quality assessment |
| 150 | + - Batch processing |
| 151 | + - Error handling |
| 152 | + |
| 153 | +### Data Models |
| 154 | + |
| 155 | +All data models are defined in `models.py`: |
| 156 | +- `Entity`: Extracted entity with type, position, and confidence |
| 157 | +- `Relationship`: Connection between entities |
| 158 | +- `Intent`: User intent with urgency and action items |
| 159 | +- `ProcessedContent`: Complete extraction results |
| 160 | +- `IngestionConfig`: Pipeline configuration |
| 161 | + |
| 162 | +## 🎯 Use Cases |
| 163 | + |
| 164 | +1. **Knowledge Management** |
| 165 | + - Extract key concepts and relationships from documents |
| 166 | + - Build knowledge graphs automatically |
| 167 | + - Identify important information |
| 168 | + |
| 169 | +2. **Task Management** |
| 170 | + - Extract action items from meeting notes |
| 171 | + - Identify deadlines and urgency |
| 172 | + - Track decisions and problems |
| 173 | + |
| 174 | +3. **Content Analysis** |
| 175 | + - Understand document topics and themes |
| 176 | + - Assess content quality |
| 177 | + - Generate tags and categories |
| 178 | + |
| 179 | +4. **Search Enhancement** |
| 180 | + - Generate semantic embeddings |
| 181 | + - Enable similarity search |
| 182 | + - Improve retrieval accuracy |
| 183 | + |
| 184 | +## 🔍 Demo |
| 185 | + |
| 186 | +Run the demo to see all features in action: |
| 187 | + |
| 188 | +```bash |
| 189 | +python demos/demo_core_extraction.py |
| 190 | +``` |
| 191 | + |
| 192 | +This will demonstrate: |
| 193 | +- Entity extraction from technical discussions |
| 194 | +- Relationship detection in meeting notes |
| 195 | +- Intent recognition with action items |
| 196 | +- Topic classification |
| 197 | +- Batch processing performance |
| 198 | + |
| 199 | +## 🚦 Performance Tips |
| 200 | + |
| 201 | +1. **Model Selection** |
| 202 | + - Use transformer models for best quality |
| 203 | + - Use large models for good balance |
| 204 | + - Use small models for speed |
| 205 | + |
| 206 | +2. **GPU Acceleration** |
| 207 | + - Enable GPU for 2-5x speedup |
| 208 | + - Batch process documents |
| 209 | + - Use async processing |
| 210 | + |
| 211 | +3. **Caching** |
| 212 | + - Enable embedding cache |
| 213 | + - Reuse pipeline instances |
| 214 | + - Process similar content together |
| 215 | + |
| 216 | +## 📊 Metrics |
| 217 | + |
| 218 | +Typical performance on modern hardware: |
| 219 | +- Entity extraction: 50-200ms per document |
| 220 | +- Relationship detection: 100-300ms per document |
| 221 | +- Embedding generation: 50-150ms per document |
| 222 | +- Full pipeline: 200-500ms per document |
| 223 | + |
| 224 | +With GPU acceleration: |
| 225 | +- 2-5x faster processing |
| 226 | +- Better for batch processing |
| 227 | +- Handles longer documents efficiently |
0 commit comments