Tweet classification

Code organization

The repository is organized as follows:

ml-project-2-cencorecarre/
    ├── best_roberta_tokenizer/
    │   ├── merges.txt
    │   ├── special_tokens_map.json
    │   ├── tokenizer_config.json
    │   └── vocab.json
    │     
    ├── models/
    │   ├── neural_networks/
    │   │   ├── best_models/
    │   │   ├── predictions/
    │   │   │   ├── pred_cnn.csv
    │   │   │   ├── pred_gru.csv
    │   │   │   └── pred_lstm.csv
    │   │   │
    │   │   ├── cnn.py
    │   │   └── rnn.py
    │   │   
    │   │── __init__.py
    │   │   
    ├── src/
    │   ├── embeddings/
    │   │   ├── build_vocab.sh
    │   │   ├── cooc.pkl
    │   │   ├── cooc.py
    │   │   ├── cut_vocab.sh
    │   │   ├── glove_solution.py
    │   │   ├── glove_template.py
    │   │   ├── pickle_vocab.py
    │   |   |── vocab_cut.txt
    │   |   |── vocab_full.txt
    │   │   └── vocab.pkl
    │   │   
    │   │── resources/
    │   ├── init.py
    │   ├── train_or_evaluate.py
    │   └── utils.py
    │
    ├── twitter-datasets/
    |   ├── sample_submission.csv
    │   ├── test_data.txt
    │   ├── train_neg_full.txt
    │   ├── train_neg.txt
    │   ├── train_pos_full.txt
    │   └── train_pos.txt
    │
    ├── .gitattributes
    ├── .gitignore
    ├── best_roberta_model.pt
    ├── final_predictions.csv
    ├── README.md
    ├── requirements.txt
    ├── roBERTA.ipynb
    ├── run.py
    └── teetSADataset.py

Requirements

before running the code, make sure to run : pip install -r requirements.txt in order to have all the necessary packages installed.

Generate Embeddings and Prepare tweets.pkl

Before training and evaluating models, you need to execute the init.py script.

run : python init.py --size --dim 200 = 'normal' or 'full'

(Example python init.py --size normal --dim 200)

This script initializes the necessary resources in the resources folder, including the pre-trained embeddings and the tweets.pkl file. The following files will be generated in the resources directory: resources/ ├── trained_w2v_embeddings_200.txt └── tweet.pkl

Train and Evaluate Neural Network Models

After successfully creating the required resources, you can use the train_or_evaluate.py script to train or evaluate a neural network model. The script automatically saves the trained model if it achieves better accuracy than the previously saved model of the same type.

To train, run : python train_or_evaluate.py --mode train --model_type <model_type> --size (Example : python train_or_evaluate.py --mode train --model_type cnn --size normal)

To evaluate, run : python train_or_evaluate.py --mode evaluate --model_type <model_type> --size (Example : python train_or_evaluate.py --mode evaluate --model_type cnn --size normal)

<model_type> = 'cnn' or 'lstm' or 'bi_lstm' or 'gru' ; = 'normal' or 'full' (must be the same as in step ii)

Train and Evaluate roBERTa Model

roberta.ipynb is a notebook for training and evaluating a RoBERTa model for tweet classification. The task involves classifying tweets as positive or negative.

Prerequisites

Environment: Use Google Colab or a local GPU-enabled setup for faster training. We used the T4 GPU.
Libraries: Install transformers, torch, scikit-learn, pandas, matplotlib, and seaborn.
Data: Place train_pos_full.txt, train_neg_full.txt, and test_data.txt in a twitter-datasets folder that is in the same folder as roberta.ipynb.

Usage

The notebook is divided into five main sections:

Collab Setup:
- Ensures all dependencies are installed and sets up the environment for training (including GPU initialization).
Dataset Setup and Initial Functions:
- Contains preprocessing functions to clean and tokenize tweets.
- Splits data into training and validation sets using TweetSADataset.
Cross-Validation for Best Hyperparameters:
- Uses stratified k-fold (k=3) cross-validation to tune key hyperparameters:
  - Learning rate (lr)
  - Warmup percentage (warmup_percent)
  - Epochs (epochs)
- Stores cross-validation results in cv_results.csv and visualizes them with scatterplots (e.g., macro-F1 and validation loss).
Final Training with Best Parameters:
- Once the best hyperparameters are found, the model is fine-tuned on the entire training set.
- The final model and tokenizer are saved as best_roberta_model.pt and best_roberta_tokenizer.
Final Test Data Prediction:
- The fine-tuned model is used to predict the labels for test_data.txt.
- Converts the predictions into a CSV file (predictions.csv) for submission on AIcrowd. The format is <tweet_id>, <prediction>.

Outputs

Model: Saved as best_roberta_model.pt.
Tokenizer: Saved as best_roberta_tokenizer.
Results: Cross-validation results in cv_results.csv.
Predictions: Final test predictions saved in predictions.csv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet classification

Code organization

Requirements

Generate Embeddings and Prepare tweets.pkl

Train and Evaluate Neural Network Models

Train and Evaluate roBERTa Model

Prerequisites

Usage

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
best_roberta_tokenizer		best_roberta_tokenizer
models		models
src		src
twitter-datasets		twitter-datasets
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
best_roberta_model.pt		best_roberta_model.pt
final_predictions.csv		final_predictions.csv
requirements.txt		requirements.txt
roBERTA.ipynb		roBERTA.ipynb
run.py		run.py
tweetSADataset.py		tweetSADataset.py

CS-433/ml-project-2-sentiment-analysis

Folders and files

Latest commit

History

Repository files navigation

Tweet classification

Code organization

Requirements

Generate Embeddings and Prepare tweets.pkl

Train and Evaluate Neural Network Models

Train and Evaluate roBERTa Model

Prerequisites

Usage

Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages