SamEval-2024_Task-8_M4

Requirements

Shared task data for training the model
Python 3.10 or higher
For all the required packages, see the requirements.txt file
For spacy you need to download the English language model with python -m spacy download en_core_web_sm

Add data to the project

To run the code the data needs to be added to the data folder. Download the datasets from the Google Drive folder. Extract the files and make sure the folder structure looks like this:

./data
├── .gitkeep
├── SubtaskA
│   ├── subtaskA_dev_monolingual.jsonl
│   ├── subtaskA_dev_multilingual.jsonl
│   ├── subtaskA_train_monolingual.jsonl
│   └── subtaskA_train_multilingual.jsonl
├── SubtaskB
│   ├── subtaskB_dev.jsonl
│   └── subtaskB_train.jsonl
└── SubtaskC
    ├── subtaskC_dev.jsonl
    └── subtaskC_train.jsonl

This way, when code does use hardcoded filepaths, no errors will occur. If code takes command line arguments the examples will use this structure.

Main workflow

The main workflow is as follows:

Create features for the training and test data with the create_features.py file
Vectorize the features with the vectorize.py file
Train the model and test the model with the model.py file

To test which features work best you can use the run.py file, which will run the file on all combinations of features, different traditional classifiers, and different neural networks with combinations of hyperparameters.

The test.py file can be used to train the model and retrieve the predictions for the test data.

Create features

In the examples beneath here, the features are created for all the feature options we have.

Train features:

python create_features.py --input data/SubtaskA/subtaskA_train_monolingual.jsonl --output features/SubtaskA/train_monolingual --features domain tense voice sentiment named-entities pronouns pos-tags dep-tags

Dev features:

python create_features.py --input data/SubtaskA/subtaskA_train_monolingual.jsonl --output features/SubtaskA/train_monolingual --features domain tense voice sentiment named-entities pronouns pos-tags dep-tags

Test features:

python create_features.py --input test_data/subtaskA_monolingual.jsonl --output features/SubtaskA/test_monolingual --features 
tense voice pronouns named-entities sentiment pos-tags dep-tags sentences sentence-similarity

The train and dev features for Subtask A are already created. They can be found on google drive. You can also find the vectors we used for training and testing the model on the shared task data.

For more help on the arguments, run python create_features.py --help

Vectorize features

In the examples for vectorizing, only the sentiment feature is used for vectorizing.

Train features:

python vectorize.py --output vectors/SubtaskA/train_monolingual --input features/SubtaskA/train_monolingual --features sentiment

Dev features:

python vectorize.py --output vectors/SubtaskA/dev_monolingual --input features/SubtaskA/dev_monolingual --features sentiment --vectorizer vectors/SubtaskA/train_monolingual

For more help on the arguments, run python vectorize.py --help

Training files directory structure

While in the create_features.py file and the vectorize.py file the input and output directories can be determined, not all training files are this flexible. They require the structure of the data to be as follows.

For the features it will be:

features/SubtaskA/dev_monolingual/dep-tags.json

Where features is the output directory, SubtaskA is the subtask, dev_monolingual is the dataset and dep-tags is the feature.

For the vectors it will be:

vectors/SubtaskA/dev_monolingual/dep-tags/vectors.npy

Where vectors is the output directory, SubtaskA is the subtask, dev_monolingual is the dataset, dep-tags is the feature and vectors.npy is the vectorized feature.

This structure is needed for the run.py and test.py files. For the model.py file you can specify the input directory for both the training and test vectors. Also for the training and test data, but the internal structure of the data directory should be the same as provided by the shared task organizers.

Models with features used for submission

Subtask A

The classifier models

Classifier	features	Accuracy
SVM	tense - voice	0.687
SVM	tense - voice - ratio-pron-ne	0.699
SVM	tense - voice - sentence-similarity - ratio-pron-ne	0.698

The nn models

Model options	features	Accuracy
3 - 12e - b64 - l0.0001	tense - voice - ratio-pron-ne	0.72
3 - 8e - b32 - l0.0001	tense - voice - ratio-pron-ne - sentence-similarity	0.701
3 - 10e - b64 -l0.0001	tense - voice - sentence-similarity	0.697

Subtask B

The classifier models

Classifier	features	accuracy
SVM	tense - voice - sentiment - pos-tags - dep-tags - sentence-similarity - ratio-pron-ne	0.592

The nn models

Model options	features	accuracy
2 - 48e - b32 - l0.0005	tense - voice - pos-tags - dep-tags - ratio-pron-ne	0.62
5 - 48e - b32 - l0.0005	tense - voice - pos-tags - dep-tags - sentence-similarity - ratio-pron-ne	0.625

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
job_scripts		job_scripts
output		output
results		results
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
create_features.py		create_features.py
discover.ipynb		discover.ipynb
evaluate_model.py		evaluate_model.py
format_checker.py		format_checker.py
model.py		model.py
requirements.txt		requirements.txt
results_svm_A.csv		results_svm_A.csv
results_test_A.csv		results_test_A.csv
results_test_B.csv		results_test_B.csv
run.py		run.py
test.py		test.py
utilities.py		utilities.py
vectorize.py		vectorize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SamEval-2024_Task-8_M4

Requirements

Add data to the project

Main workflow

Create features

Vectorize features

Training files directory structure

Models with features used for submission

Subtask A

Subtask B

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

bbjoverbeek/SamEval-2024_Task-8_M4

Folders and files

Latest commit

History

Repository files navigation

SamEval-2024_Task-8_M4

Requirements

Add data to the project

Main workflow

Create features

Vectorize features

Training files directory structure

Models with features used for submission

Subtask A

Subtask B

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages