Skip to content

Pytorch implementation of Text Style Classification comparing different architecture accuracy in predicting 3 italic language styles: Dante, Italian and Neapolitan.

License

Notifications You must be signed in to change notification settings

TheMariuolo/Italic_Text_Style_Classification

Repository files navigation

Text Style Classification between Dante, Italian and Neapolitan language: a comparison between CNN, RNNs and Transformer based Classifiers 📖 🤔

A Pytorch implementation of Text Style Classification.

The goal of this project is to compare different architecture accuracy in predicting 3 italic language styles: Dante, Italian and Neapolitan.

The Classifiers taken in consideration are CNN, RNNs, and Transformer based:

  • CNN Classifier: made up of three 2D Convolutional layers with a 3x3 kernel
  • RNN Classifier: takes the last hidden state of a Recursive Neural Network and classifies sentences from it
  • GRU Classifier: takes the last hidden state of a Recursive Neural Network with GRU cell and classifies sentences from it
  • LSTM Classifier: takes the last hidden state of a Recursive Neural Network with LSTM cell and classifies sentences from it
  • Transformer Classifier: averages the output of a Transformer Encoder over words in a sentence. It's based of self-attention mechanism

The Word Embedding layer has been initialized using a Word2vec model trained on these three corpus, one for each language:

  • Dante: Divina Commedia
  • Italian: Uno, nessuno e centomila by Luigi Pirandello, I Malavoglia by Giovanni Verga
  • Neapolitan: Lo cunto de li cunti by Giambattista Basile

Table of Contents

Structure

  • PRE_TRAINED_TEXT_CLASSIFIER.ipynb is the main notebook containing pre trained models, classification examples and accuracy computations
  • TEXT_CLASSIFIER.ipynb is a notebook where all the models are defined and trained, with classification examples and accuracy computations
  • models.py is the module with model class definitions
  • data_config.py is the module containing the functions for making the dataset in form of torch Dataloaders
  • training_function.py is the module containing the functions for training
  • accuracy.py is the module containing the function computing model accuracy
  • text_corpus repository contains the three corpus used for training
  • pretrained repository contains the .pth files with the parameters from pre-trained models

Requirements

  • Numpy
  • Matplotlib
  • Pytorch
  • Gensim

Usage

First clone the repository:

git clone git@github.com:MassimoMario/Italic_Text_Style_Classification.git

Make sure to have Pytorch and Gensim installed:

pip install torch
pip install gensim

Run cells from TEXT_CLASSIFIER.ipynb notebook if you want to train the models yourself, or run PRE_TRAINED_TEXT_CLASSIFIER.ipynb for using pre-trained models. Both notebooks have classification examples and accuracy computation after every model section.

Results

Here the training curves for these five classifiers:

Table with prediction accuracies evaluated on test datasets with 1017 sentences for each style:

Classifier # Parameters Dante Italian Neapolitan Overall
CNN 8139 98.13% 99.21% 99.80% 99.04%
RNN 3107 96.95% 97.64% 96.20% 96.93%
GRU 8067 96.26% 99.11% 99.00% 98.12%
LSTM 10547 97.93% 99.41% 99.40% 98.91%
Transformer 3188599 98.91% 99.41% 99.80% 99.37%

Bonus

Since when we italians are young, we learn in school that Inferno, Purgatorio and Paradiso, the three main parts of Divina Commedia, have been written in 3 different styles.

Can these models capture these stylistic differences?

Classifier Inferno Purgatorio Paradiso Overall
CNN 57.01% 20.82% 70.65% 49.49%
RNN 52.33% 24.04% 38.32% 38.23%
GRU 47.66% 28.16% 50.00% 41.93%
LSTM 59.35% 25.51% 63.17% 49.34%
Transformer 48.53% 39.88% 60.77% 49.73%

From these accuracies it seems they can't 😞

Even if on Inferno and Paradiso test set models perform much better respect to predicting Purgatorio test set. I guess it's because Inferno and Paradiso have a more recognizable writing style given by a precise choice of words by Dante.

About

Pytorch implementation of Text Style Classification comparing different architecture accuracy in predicting 3 italic language styles: Dante, Italian and Neapolitan.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published