Skip to content

Jalpan04/gujarati-author-attribution

Repository files navigation

Neural Stylometry: Authorship Attribution in Gujarati Literature

This repository contains the code and findings for an end-to-end machine learning project on authorship attribution for Gujarati literary texts. The project conducts a comparative analysis between a traditional LSTM model and a modern, pre-trained Transformer model on a small, imbalanced, low-resource dataset.


🧠 Key Finding

The core finding of this project is the dramatic performance difference between a model trained from scratch and one using transfer learning. On a highly imbalanced dataset, the pre-trained Transformer model achieved 97.7% accuracy, while the LSTM model failed with a misleading 59.0% accuracy, proving unable to learn the features of the minority-class authors.


📊 Performance Summary

Model Overall Accuracy F1-Score (Kalapi) F1-Score (Mehta) F1-Score (Meghani)
📉 LSTM 59.0% 0.74 0.00 0.00
🚀 Transformer 97.7% 0.977 0.977 0.977

📈 Visualizations

Overall Accuracy Comparison

The Transformer model achieved a 38.7 percentage point improvement in accuracy, demonstrating its superior ability to generalize from limited data. This highlights the power of transfer learning in low-resource scenarios.

F1-Score Comparison per Author

This graph clearly shows the LSTM's failure. It achieved a zero F1-score for two of the three authors, indicating it was a biased model that only learned to predict a single author. The Transformer, in contrast, performed robustly across all classes, proving its effectiveness.


🧾 Conclusion

This project successfully demonstrates that for NLP tasks in low-resource languages like Gujarati, transfer learning with pre-trained models is a significantly more effective strategy than training simpler models from scratch, especially when dealing with real-world data imbalance.

The complete methodology, from data scraping to model training and evaluation, is available in the src directory for review.

About

An end-to-end machine learning project on authorship attribution for Gujarati literary texts. The project conducts a comparative analysis between a traditional LSTM model and a modern, pre-trained Transformer model on a small, imbalanced, low-resource dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages