This repository contains the code and findings for an end-to-end machine learning project on authorship attribution for Gujarati literary texts. The project conducts a comparative analysis between a traditional LSTM model and a modern, pre-trained Transformer model on a small, imbalanced, low-resource dataset.
The core finding of this project is the dramatic performance difference between a model trained from scratch and one using transfer learning. On a highly imbalanced dataset, the pre-trained Transformer model achieved 97.7% accuracy, while the LSTM model failed with a misleading 59.0% accuracy, proving unable to learn the features of the minority-class authors.
| Model | Overall Accuracy | F1-Score (Kalapi) | F1-Score (Mehta) | F1-Score (Meghani) |
|---|---|---|---|---|
| 📉 LSTM | 59.0% | 0.74 | 0.00 | 0.00 |
| 🚀 Transformer | 97.7% | 0.977 | 0.977 | 0.977 |
The Transformer model achieved a 38.7 percentage point improvement in accuracy, demonstrating its superior ability to generalize from limited data. This highlights the power of transfer learning in low-resource scenarios.
This graph clearly shows the LSTM's failure. It achieved a zero F1-score for two of the three authors, indicating it was a biased model that only learned to predict a single author. The Transformer, in contrast, performed robustly across all classes, proving its effectiveness.
This project successfully demonstrates that for NLP tasks in low-resource languages like Gujarati, transfer learning with pre-trained models is a significantly more effective strategy than training simpler models from scratch, especially when dealing with real-world data imbalance.
The complete methodology, from data scraping to model training and evaluation, is available in the src directory for review.