The repository is organized as follows:
ml-project-2-cencorecarre/
├── best_roberta_tokenizer/
│ ├── merges.txt
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ └── vocab.json
│
├── models/
│ ├── neural_networks/
│ │ ├── best_models/
│ │ ├── predictions/
│ │ │ ├── pred_cnn.csv
│ │ │ ├── pred_gru.csv
│ │ │ └── pred_lstm.csv
│ │ │
│ │ ├── cnn.py
│ │ └── rnn.py
│ │
│ │── __init__.py
│ │
├── src/
│ ├── embeddings/
│ │ ├── build_vocab.sh
│ │ ├── cooc.pkl
│ │ ├── cooc.py
│ │ ├── cut_vocab.sh
│ │ ├── glove_solution.py
│ │ ├── glove_template.py
│ │ ├── pickle_vocab.py
│ | |── vocab_cut.txt
│ | |── vocab_full.txt
│ │ └── vocab.pkl
│ │
│ │── resources/
│ ├── init.py
│ ├── train_or_evaluate.py
│ └── utils.py
│
├── twitter-datasets/
| ├── sample_submission.csv
│ ├── test_data.txt
│ ├── train_neg_full.txt
│ ├── train_neg.txt
│ ├── train_pos_full.txt
│ └── train_pos.txt
│
├── .gitattributes
├── .gitignore
├── best_roberta_model.pt
├── final_predictions.csv
├── README.md
├── requirements.txt
├── roBERTA.ipynb
├── run.py
└── teetSADataset.py
before running the code, make sure to run : pip install -r requirements.txt in order to have all the necessary packages installed.
Before training and evaluating models, you need to execute the init.py script.
run : python init.py --size --dim 200 = 'normal' or 'full'
(Example python init.py --size normal --dim 200)
This script initializes the necessary resources in the resources folder, including the pre-trained embeddings and the tweets.pkl file. The following files will be generated in the resources directory: resources/ ├── trained_w2v_embeddings_200.txt └── tweet.pkl
After successfully creating the required resources, you can use the train_or_evaluate.py script to train or evaluate a neural network model. The script automatically saves the trained model if it achieves better accuracy than the previously saved model of the same type.
To train, run : python train_or_evaluate.py --mode train --model_type <model_type> --size (Example : python train_or_evaluate.py --mode train --model_type cnn --size normal)
To evaluate, run : python train_or_evaluate.py --mode evaluate --model_type <model_type> --size (Example : python train_or_evaluate.py --mode evaluate --model_type cnn --size normal)
<model_type> = 'cnn' or 'lstm' or 'bi_lstm' or 'gru' ; = 'normal' or 'full' (must be the same as in step ii)
roberta.ipynb is a notebook for training and evaluating a RoBERTa model for tweet classification. The task involves classifying tweets as positive or negative.
- Environment: Use Google Colab or a local GPU-enabled setup for faster training. We used the T4 GPU.
- Libraries: Install
transformers,torch,scikit-learn,pandas,matplotlib, andseaborn. - Data: Place
train_pos_full.txt,train_neg_full.txt, andtest_data.txtin atwitter-datasetsfolder that is in the same folder asroberta.ipynb.
The notebook is divided into five main sections:
-
Collab Setup:
- Ensures all dependencies are installed and sets up the environment for training (including GPU initialization).
-
Dataset Setup and Initial Functions:
- Contains preprocessing functions to clean and tokenize tweets.
- Splits data into training and validation sets using
TweetSADataset.
-
Cross-Validation for Best Hyperparameters:
- Uses stratified k-fold (k=3) cross-validation to tune key hyperparameters:
- Learning rate (
lr) - Warmup percentage (
warmup_percent) - Epochs (
epochs)
- Learning rate (
- Stores cross-validation results in
cv_results.csvand visualizes them with scatterplots (e.g., macro-F1 and validation loss).
- Uses stratified k-fold (k=3) cross-validation to tune key hyperparameters:
-
Final Training with Best Parameters:
- Once the best hyperparameters are found, the model is fine-tuned on the entire training set.
- The final model and tokenizer are saved as
best_roberta_model.ptandbest_roberta_tokenizer.
-
Final Test Data Prediction:
- The fine-tuned model is used to predict the labels for
test_data.txt. - Converts the predictions into a CSV file (
predictions.csv) for submission on AIcrowd. The format is<tweet_id>, <prediction>.
- The fine-tuned model is used to predict the labels for
- Model: Saved as
best_roberta_model.pt. - Tokenizer: Saved as
best_roberta_tokenizer. - Results: Cross-validation results in
cv_results.csv. - Predictions: Final test predictions saved in
predictions.csv.