Skip to content

CS-433/ml-project-2-sentiment-analysis

Repository files navigation

Tweet classification

Code organization

The repository is organized as follows:

ml-project-2-cencorecarre/
    ├── best_roberta_tokenizer/
    │   ├── merges.txt
    │   ├── special_tokens_map.json
    │   ├── tokenizer_config.json
    │   └── vocab.json
    │     
    ├── models/
    │   ├── neural_networks/
    │   │   ├── best_models/
    │   │   ├── predictions/
    │   │   │   ├── pred_cnn.csv
    │   │   │   ├── pred_gru.csv
    │   │   │   └── pred_lstm.csv
    │   │   │
    │   │   ├── cnn.py
    │   │   └── rnn.py
    │   │   
    │   │── __init__.py
    │   │   
    ├── src/
    │   ├── embeddings/
    │   │   ├── build_vocab.sh
    │   │   ├── cooc.pkl
    │   │   ├── cooc.py
    │   │   ├── cut_vocab.sh
    │   │   ├── glove_solution.py
    │   │   ├── glove_template.py
    │   │   ├── pickle_vocab.py
    │   |   |── vocab_cut.txt
    │   |   |── vocab_full.txt
    │   │   └── vocab.pkl
    │   │   
    │   │── resources/
    │   ├── init.py
    │   ├── train_or_evaluate.py
    │   └── utils.py
    │
    ├── twitter-datasets/
    |   ├── sample_submission.csv
    │   ├── test_data.txt
    │   ├── train_neg_full.txt
    │   ├── train_neg.txt
    │   ├── train_pos_full.txt
    │   └── train_pos.txt
    │
    ├── .gitattributes
    ├── .gitignore
    ├── best_roberta_model.pt
    ├── final_predictions.csv
    ├── README.md
    ├── requirements.txt
    ├── roBERTA.ipynb
    ├── run.py
    └── teetSADataset.py

Requirements

before running the code, make sure to run : pip install -r requirements.txt in order to have all the necessary packages installed.

Generate Embeddings and Prepare tweets.pkl

Before training and evaluating models, you need to execute the init.py script.

run : python init.py --size --dim 200 = 'normal' or 'full'

(Example python init.py --size normal --dim 200)

This script initializes the necessary resources in the resources folder, including the pre-trained embeddings and the tweets.pkl file. The following files will be generated in the resources directory: resources/ ├── trained_w2v_embeddings_200.txt └── tweet.pkl

Train and Evaluate Neural Network Models

After successfully creating the required resources, you can use the train_or_evaluate.py script to train or evaluate a neural network model. The script automatically saves the trained model if it achieves better accuracy than the previously saved model of the same type.

To train, run : python train_or_evaluate.py --mode train --model_type <model_type> --size (Example : python train_or_evaluate.py --mode train --model_type cnn --size normal)

To evaluate, run : python train_or_evaluate.py --mode evaluate --model_type <model_type> --size (Example : python train_or_evaluate.py --mode evaluate --model_type cnn --size normal)

<model_type> = 'cnn' or 'lstm' or 'bi_lstm' or 'gru' ; = 'normal' or 'full' (must be the same as in step ii)

Train and Evaluate roBERTa Model

roberta.ipynb is a notebook for training and evaluating a RoBERTa model for tweet classification. The task involves classifying tweets as positive or negative.


Prerequisites

  1. Environment: Use Google Colab or a local GPU-enabled setup for faster training. We used the T4 GPU.
  2. Libraries: Install transformers, torch, scikit-learn, pandas, matplotlib, and seaborn.
  3. Data: Place train_pos_full.txt, train_neg_full.txt, and test_data.txt in a twitter-datasets folder that is in the same folder as roberta.ipynb.

Usage

The notebook is divided into five main sections:

  1. Collab Setup:

    • Ensures all dependencies are installed and sets up the environment for training (including GPU initialization).
  2. Dataset Setup and Initial Functions:

    • Contains preprocessing functions to clean and tokenize tweets.
    • Splits data into training and validation sets using TweetSADataset.
  3. Cross-Validation for Best Hyperparameters:

    • Uses stratified k-fold (k=3) cross-validation to tune key hyperparameters:
      • Learning rate (lr)
      • Warmup percentage (warmup_percent)
      • Epochs (epochs)
    • Stores cross-validation results in cv_results.csv and visualizes them with scatterplots (e.g., macro-F1 and validation loss).
  4. Final Training with Best Parameters:

    • Once the best hyperparameters are found, the model is fine-tuned on the entire training set.
    • The final model and tokenizer are saved as best_roberta_model.pt and best_roberta_tokenizer.
  5. Final Test Data Prediction:

    • The fine-tuned model is used to predict the labels for test_data.txt.
    • Converts the predictions into a CSV file (predictions.csv) for submission on AIcrowd. The format is <tweet_id>, <prediction>.

Outputs

  • Model: Saved as best_roberta_model.pt.
  • Tokenizer: Saved as best_roberta_tokenizer.
  • Results: Cross-validation results in cv_results.csv.
  • Predictions: Final test predictions saved in predictions.csv.

About

ml-project-2-cencorecarre created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors