Skip to content

elenanereiss/Twitter_COVID19

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

German Twitter COVID-19 Dataset

We present a new dataset with multi-layer annotations for German Covid-related tweets. The dataset includes 643 tweets extracted during the pandemic (from 2020-03-01 to 2022-09-01) using the twint library. The scripts for extracting the tweets can be found in the folder data_extraction.

The annotations include two pipelines for named entities and topic-based credibility annotations. The dataset was annotated in INCEpTION tool by three linguists. For named entities we conduct annotations at token level and the annotations of the credibility pipeline are conducted at tweet level.

Annotation Scheme

We create an annotation pipeline to enhance information extraction from COVID-19-specific social media content. As first step in topic-based credibility annotation pipeline, the annotations aim to filter tweets that are uninformative, unrelated or noncredible. To extract the main entities from the tweets in the second step, we also carry out domain-adapted named entity annotations using typical entities as names of persons, organizations and locations as well as COVID-19-specific entities for the NER-scheme. The annotation scheme is defined below. The corresponding classes are listed in the tables in the section Dataset statistics.

  • Topic-based Credibility Annotations
    • Informativeness
    • Topic
    • Credibility
  • Named Entities

For more information about the developing the annotation scheme, the definition of semantic classes, examples etc., see the annotation guidelines.

Dataset statistics

You can use a Count labels.ipynb to count the labels and to find the distribution of classes in the dataset.

Topic-based Credibility Annotations

Task/Classes # %
Informativeness
informative 418 65.01
personal_experience 55 8.55
none 170 26.44
Topic
case_report 182 28.30
consequences 30 4.67
governm_decisions 110 17.11
risk_reduction 15 2.33
vaccination 95 14.77
none 211 32.81
Credibility
credible 394 61.28
non-credible 10 1.56
none 239 37.17
total 643 100

Named Entities

Classes # %
disease 697 19.41
location 436 12.14
location_body 34 0.95
measure 526 14.65
mortality 241 6.71
organization 315 8.77
person 205 5.71
quantifiers 405 11.28
symptom 369 10.28
time 363 10.11
total 3591 100

Format

The dataset is in JSON format. A unit of the dataset is a tweet with the following information: title, tokens, named_entity_recognition, relations, informativeness, topic and credibility.

{
            "title": "tweet_1519969762578780160.txt",
            "text": "",
            "tokens": [],
            "named_entity_recognition": [
                "O",
                "O",
                "O",
                "O",
                "O",
                "B-DISEASE",
                "O",
                "O",
                "O",
                "O",
                "B-DISEASE",
                "O",
                "O",
                "O",
                "B-ORGANIZATION",
                "O",
                "B-SYMPTOM",
                "O",
                "O",
                "O",
                "B-MEASURE",
                "O",
                "B-SYMPTOM",
                "O",
                "O",
                "O",
                "O",
                "O",
                "B-MEASURE",
                "O",
                "B-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "O",
                "B-MEASURE",
                "B-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "O",
                "O",
                "O",
                "O",
                "O",
                "O",
                "O"
            ],
            "informativeness": "Informative",
            "topic": "Vaccination",
            "credibility": "credible"
        }

Due to redistribution policy of Twitter Content, we cannot publish this dataset. In the folder data you will find an anonymized dataset in which text and tokens are empty (like in the example above).

Evaluation strategy and Metrics

Considering the size of the dataset, we decided to use 5 fold cross-validation. The dataset was split in train and validation datasets (corresponds to 80% and 20% - ca. 514 and 129 tweets for each fold). To evaluate performance we used precision, recall and F1 measure with average weighted by the number of true instances for each label. In NER task, we measuared results using seqeval metric for sequence labelling evaluation. This metric calculates scores for a class as you can see from the results in tables and not for a chunk (e.g. B-PERSON, I-PERSON).

Experiments

We focused on fine-tuning BERT-based models and Llama models from HuggingFace.

Decoders

Selected models and Hyperparameters

In our experiments for text and token classification, we used Huggingface models. We filtered out models for special domains such as legal, medical, financial, hospitality etc. Overall, we have selected five models with the best results for the second test phase:

For each model and task, we initialized 200 trials with the goal of maximizing F1 measure. For text and token classification, hyperparameter space is as follow defined:

hp_textclassification = {
	"learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True),
	"num_train_epochs": trial.suggest_int("num_train_epochs", 1,15),
	"seed": trial.suggest_int("seed", 1, 40),
	"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32]),
	"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
	"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True),
	"gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1,2,4,8,16]),
}

hp_tokenclassification = {
	"learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True),
	"num_train_epochs": trial.suggest_int("num_train_epochs", 1,30),
	"seed": trial.suggest_int("seed", 1, 40),
	"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32]),
	"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
	"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True),
	"gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1,2,4,8,16]),
}

Results

Results for 5 fold cross-validation (includes results per class) see in results.

A. Informativeness
bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large
precision 82.34 82.11 83.53 86.09 75.80
recall 84.29 82.42 83.82 86.15 81.80
f1-score 82.25 82.09 83.48 85.83 78.56
support 128.6 128.6 128.6 128.6 128.6
B. Topic
bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large
precision 79.86 78.06 77.14 78.46 81.29
recall 79.15 77.45 77.28 78.38 80.24
f1-score 78.94 77.25 76.78 77.77 80.07
support 128.6 128.6 128.6 128.6 128.6
C. Credibility
bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large
precision 81.07 79.47 79.23 81.76 82.16
recall 82.42 78.99 78.85 82.42 76.63
f1-score 81.59 78.98 78.65 81.53 73.72
support 128.6 128.6 128.6 128.6 128.6
D. Named Entity Recognition
bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large
precision 79.94 75.59 75.60 79.21 80.83
recall 81.85 80.75 80.99 83.17 84.10
f1-score 80.67 77.81 78.02 80.95 82.27
support 1628.8 1631.2 1497.6 1547.8 1547.8

Decoders

Selected models and Hyperparameters

From HuggingFace, we have also selected two models with the best results for the second test phase:

For each model and task, we performed parameter-efficient fine-tuning using QLoRa. To avoid overfitting but to evaluate the models on equal terms, we applied the early stopping technique and set other hyperparameters to default. Default hyperparameter are as follow defined:

hyperparameters = (
	learning_rate = 4e-5,
	per_device_train_batch_size = 8,
	per_device_eval_batch_size = 8,
	num_train_epochs = 20,
	eval_steps=50,
	save_steps=50,
	metric_for_best_model="eval_loss",
	weight_decay = 0.01,
	eval_strategy="steps",
)

Results

Results for 5 fold cross-validation (includes results per class) see in results.

A. Informativeness
Llama-3.1-8B Llama3-German-8B
precision 80.48 80.71
recall 79.78 80.56
f1-score 79.99 80.43
support 128.6 128.6
B. Topic
Llama-3.1-8B Llama3-German-8B
precision 73.45 75.97
recall 71.06 74.79
f1-score 71.27 74.61
support 128.6 128.6
C. Credibility
Llama-3.1-8B Llama3-German-8B
precision 78.17 80.14
recall 77.92 80.71
f1-score 77.55 80.37
support 128.6 128.6
D. Named Entity Recognition
Llama-3.1-8B Llama3-German-8B
precision 62.85 63.43
recall 63.90 63.09
f1-score 62.93 62.88
support 2077.60 2077.60

Runs

First please split the dataset with python3 split_dataset_5cv.py data/Twitter_COVID19.json. You can also use a Jupyter Notebook.

For experiments, define an output and a cache directory in config.py.

Hyperparameter search

For hyperparameter search run:

python3 textclassification_{TASK_NAME}_search.py [MODEL_NAME]
python3 tokenclassification_search.py [MODEL_NAME]

See in best_hp.py hyperparameters for best runs for the selected models.

Fine-tuning

For fine-tuning a model (models) as text or token classification run:

python3 textclassification.py TASK_NAME [MODEL_NAME(S)]
python3 llm_textclassification.py TASK_NAME [MODEL_NAME(S)]
python3 textclassification.py topic bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large
python3 tokenclassification.py [MODEL_NAME(S)]
python3 llm_tokenclassification.py [MODEL_NAME(S)]
python3 tokenclassification.py bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large

About

Twitter Dataset with multi-layer annotations for German Covid-related tweets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors