Comparative analysis of text classification methods: fastText, BERT, RAG approach, and scikit-learn models on a single corpus.
| Model | Approach | Key Features |
|---|---|---|
| FastText | Shallow neural network | Fast training, subword embeddings |
| Scikit-learn | TF-IDF + Logistic Regression | Lightweight, interpretable |
| BERT | Transformer | High accuracy, contextual embeddings |
| RAG | TF-IDF + kNN retrieval | No training required, explainable |
| VectorDB RAG | ChromaDB + LaBSE | Semantic search, multilingual |
uv sync # select required groups to work withSee experiments/ directory for training and evaluation scripts.
uv run pytestPublished models and datasets are available in the spam-detection collection.