Commit 504e226
committed
feat: implement Topic Modeling (LDA/NMF) - Sprint 1 complete
- Created scrapetui/ai/topic_modeling.py with TopicModelingManager class
- Implemented LDA (Latent Dirichlet Allocation) topic modeling
- Implemented NMF (Non-negative Matrix Factorization) topic modeling
- Added category assignment and hierarchical topic structure
- All 16 legacy topic modeling tests passing (100%)
Features:
- Automatic topic extraction with configurable number of topics
- Top words per topic with weights
- Article-to-topic assignments with confidence scores
- Multi-label topic distribution for each article
- Hierarchical topic modeling (parent/child topics)
- Database integration for category assignment
- Comprehensive error handling
Algorithms:
- LDA: Uses CountVectorizer + LatentDirichletAllocation from sklearn
- NMF: Uses TfidfVectorizer + NMF from sklearn
- Both support configurable parameters (iterations, top words, etc.)
- Automatic topic labeling from top 3 words
Test Coverage:
- Empty article lists
- Single article edge case
- Topic assignments and confidence
- Category assignment
- Multi-label classification
- Threshold filtering
- Edge cases (no content, short content, large topic counts)
Export:
- TopicModelingManager exported from scrapetui module
- Backward compatible with existing test suite1 parent aa8fd1c commit 504e226
2 files changed
+399
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| |||
85 | 86 | | |
86 | 87 | | |
87 | 88 | | |
88 | | - | |
| 89 | + | |
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
| |||
0 commit comments