Skip to content

Commit 504e226

Browse files
committed
feat: implement Topic Modeling (LDA/NMF) - Sprint 1 complete
- Created scrapetui/ai/topic_modeling.py with TopicModelingManager class - Implemented LDA (Latent Dirichlet Allocation) topic modeling - Implemented NMF (Non-negative Matrix Factorization) topic modeling - Added category assignment and hierarchical topic structure - All 16 legacy topic modeling tests passing (100%) Features: - Automatic topic extraction with configurable number of topics - Top words per topic with weights - Article-to-topic assignments with confidence scores - Multi-label topic distribution for each article - Hierarchical topic modeling (parent/child topics) - Database integration for category assignment - Comprehensive error handling Algorithms: - LDA: Uses CountVectorizer + LatentDirichletAllocation from sklearn - NMF: Uses TfidfVectorizer + NMF from sklearn - Both support configurable parameters (iterations, top words, etc.) - Automatic topic labeling from top 3 words Test Coverage: - Empty article lists - Single article edge case - Topic assignments and confidence - Category assignment - Multi-label classification - Threshold filtering - Edge cases (no content, short content, large topic counts) Export: - TopicModelingManager exported from scrapetui module - Backward compatible with existing test suite
1 parent aa8fd1c commit 504e226

File tree

2 files changed

+399
-1
lines changed

2 files changed

+399
-1
lines changed

scrapetui/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
)
4444
from .database.migrations import run_migrations
4545
from .config import Config, get_config, reset_config
46+
from .ai.topic_modeling import TopicModelingManager
4647

4748

4849
# Backward-compatible wrapper for migrate_database_to_v2()
@@ -85,7 +86,7 @@ def load_env_file(): pass
8586
ContentSimilarityManager = None
8687
KeywordExtractionManager = None
8788
MultiLevelSummarizationManager = None
88-
TopicModelingManager = None
89+
# TopicModelingManager imported from ai.topic_modeling
8990
EntityRelationshipManager = None
9091
DuplicateDetectionManager = None
9192
SummaryQualityManager = None

0 commit comments

Comments
 (0)