We present a new dataset with multi-layer annotations for German Covid-related tweets. The dataset includes 643 tweets extracted during the pandemic (from 2020-03-01 to 2022-09-01) using the twint library. The scripts for extracting the tweets can be found in the folder data_extraction.
The annotations include two pipelines for named entities and topic-based credibility annotations. The dataset was annotated in INCEpTION tool by three linguists. For named entities we conduct annotations at token level and the annotations of the credibility pipeline are conducted at tweet level.
We create an annotation pipeline to enhance information extraction from COVID-19-specific social media content. As first step in topic-based credibility annotation pipeline, the annotations aim to filter tweets that are uninformative, unrelated or noncredible. To extract the main entities from the tweets in the second step, we also carry out domain-adapted named entity annotations using typical entities as names of persons, organizations and locations as well as COVID-19-specific entities for the NER-scheme. The annotation scheme is defined below. The corresponding classes are listed in the tables in the section Dataset statistics.
- Topic-based Credibility Annotations
- Informativeness
- Topic
- Credibility
- Named Entities
For more information about the developing the annotation scheme, the definition of semantic classes, examples etc., see the annotation guidelines.
You can use a Count labels.ipynb to count the labels and to find the distribution of classes in the dataset.
| Task/Classes | # | % |
|---|---|---|
| Informativeness | ||
| informative | 418 | 65.01 |
| personal_experience | 55 | 8.55 |
| none | 170 | 26.44 |
| Topic | ||
| case_report | 182 | 28.30 |
| consequences | 30 | 4.67 |
| governm_decisions | 110 | 17.11 |
| risk_reduction | 15 | 2.33 |
| vaccination | 95 | 14.77 |
| none | 211 | 32.81 |
| Credibility | ||
| credible | 394 | 61.28 |
| non-credible | 10 | 1.56 |
| none | 239 | 37.17 |
| total | 643 | 100 |
| Classes | # | % |
|---|---|---|
| disease | 697 | 19.41 |
| location | 436 | 12.14 |
| location_body | 34 | 0.95 |
| measure | 526 | 14.65 |
| mortality | 241 | 6.71 |
| organization | 315 | 8.77 |
| person | 205 | 5.71 |
| quantifiers | 405 | 11.28 |
| symptom | 369 | 10.28 |
| time | 363 | 10.11 |
| total | 3591 | 100 |
The dataset is in JSON format. A unit of the dataset is a tweet with the following information: title, tokens, named_entity_recognition, relations, informativeness, topic and credibility.
{
"title": "tweet_1519969762578780160.txt",
"text": "",
"tokens": [],
"named_entity_recognition": [
"O",
"O",
"O",
"O",
"O",
"B-DISEASE",
"O",
"O",
"O",
"O",
"B-DISEASE",
"O",
"O",
"O",
"B-ORGANIZATION",
"O",
"B-SYMPTOM",
"O",
"O",
"O",
"B-MEASURE",
"O",
"B-SYMPTOM",
"O",
"O",
"O",
"O",
"O",
"B-MEASURE",
"O",
"B-QUANTIFIERS",
"I-QUANTIFIERS",
"I-QUANTIFIERS",
"I-QUANTIFIERS",
"I-QUANTIFIERS",
"O",
"B-MEASURE",
"B-QUANTIFIERS",
"I-QUANTIFIERS",
"I-QUANTIFIERS",
"I-QUANTIFIERS",
"O",
"O",
"O",
"O",
"O",
"O",
"O"
],
"informativeness": "Informative",
"topic": "Vaccination",
"credibility": "credible"
}Due to redistribution policy of Twitter Content, we cannot publish this dataset. In the folder data you will find an anonymized dataset in which text and tokens are empty (like in the example above).
Considering the size of the dataset, we decided to use 5 fold cross-validation. The dataset was split in train and validation datasets (corresponds to 80% and 20% - ca. 514 and 129 tweets for each fold). To evaluate performance we used precision, recall and F1 measure with average weighted by the number of true instances for each label. In NER task, we measuared results using seqeval metric for sequence labelling evaluation. This metric calculates scores for a class as you can see from the results in tables and not for a chunk (e.g. B-PERSON, I-PERSON).
We focused on fine-tuning BERT-based models and Llama models from HuggingFace.
In our experiments for text and token classification, we used Huggingface models. We filtered out models for special domains such as legal, medical, financial, hospitality etc. Overall, we have selected five models with the best results for the second test phase:
- bert-base-german-cased
- bert-base-multilingual-cased
- bert-base-multilingual-uncased
- Twitter/twhin-bert-base
- Twitter/twhin-bert-large
For each model and task, we initialized 200 trials with the goal of maximizing F1 measure. For text and token classification, hyperparameter space is as follow defined:
hp_textclassification = {
"learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True),
"num_train_epochs": trial.suggest_int("num_train_epochs", 1,15),
"seed": trial.suggest_int("seed", 1, 40),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32]),
"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True),
"gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1,2,4,8,16]),
}
hp_tokenclassification = {
"learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True),
"num_train_epochs": trial.suggest_int("num_train_epochs", 1,30),
"seed": trial.suggest_int("seed", 1, 40),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32]),
"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True),
"gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1,2,4,8,16]),
}Results for 5 fold cross-validation (includes results per class) see in results.
| bert-base-german-cased | bert-base-multilingual-cased | bert-base-multilingual-uncased | Twitter/twhin-bert-base | Twitter/twhin-bert-large | |
|---|---|---|---|---|---|
| precision | 82.34 | 82.11 | 83.53 | 86.09 | 75.80 |
| recall | 84.29 | 82.42 | 83.82 | 86.15 | 81.80 |
| f1-score | 82.25 | 82.09 | 83.48 | 85.83 | 78.56 |
| support | 128.6 | 128.6 | 128.6 | 128.6 | 128.6 |
| bert-base-german-cased | bert-base-multilingual-cased | bert-base-multilingual-uncased | Twitter/twhin-bert-base | Twitter/twhin-bert-large | |
|---|---|---|---|---|---|
| precision | 79.86 | 78.06 | 77.14 | 78.46 | 81.29 |
| recall | 79.15 | 77.45 | 77.28 | 78.38 | 80.24 |
| f1-score | 78.94 | 77.25 | 76.78 | 77.77 | 80.07 |
| support | 128.6 | 128.6 | 128.6 | 128.6 | 128.6 |
| bert-base-german-cased | bert-base-multilingual-cased | bert-base-multilingual-uncased | Twitter/twhin-bert-base | Twitter/twhin-bert-large | |
|---|---|---|---|---|---|
| precision | 81.07 | 79.47 | 79.23 | 81.76 | 82.16 |
| recall | 82.42 | 78.99 | 78.85 | 82.42 | 76.63 |
| f1-score | 81.59 | 78.98 | 78.65 | 81.53 | 73.72 |
| support | 128.6 | 128.6 | 128.6 | 128.6 | 128.6 |
| bert-base-german-cased | bert-base-multilingual-cased | bert-base-multilingual-uncased | Twitter/twhin-bert-base | Twitter/twhin-bert-large | |
|---|---|---|---|---|---|
| precision | 79.94 | 75.59 | 75.60 | 79.21 | 80.83 |
| recall | 81.85 | 80.75 | 80.99 | 83.17 | 84.10 |
| f1-score | 80.67 | 77.81 | 78.02 | 80.95 | 82.27 |
| support | 1628.8 | 1631.2 | 1497.6 | 1547.8 | 1547.8 |
From HuggingFace, we have also selected two models with the best results for the second test phase:
For each model and task, we performed parameter-efficient fine-tuning using QLoRa. To avoid overfitting but to evaluate the models on equal terms, we applied the early stopping technique and set other hyperparameters to default. Default hyperparameter are as follow defined:
hyperparameters = (
learning_rate = 4e-5,
per_device_train_batch_size = 8,
per_device_eval_batch_size = 8,
num_train_epochs = 20,
eval_steps=50,
save_steps=50,
metric_for_best_model="eval_loss",
weight_decay = 0.01,
eval_strategy="steps",
)Results for 5 fold cross-validation (includes results per class) see in results.
| Llama-3.1-8B | Llama3-German-8B | |
|---|---|---|
| precision | 80.48 | 80.71 |
| recall | 79.78 | 80.56 |
| f1-score | 79.99 | 80.43 |
| support | 128.6 | 128.6 |
| Llama-3.1-8B | Llama3-German-8B | |
|---|---|---|
| precision | 73.45 | 75.97 |
| recall | 71.06 | 74.79 |
| f1-score | 71.27 | 74.61 |
| support | 128.6 | 128.6 |
| Llama-3.1-8B | Llama3-German-8B | |
|---|---|---|
| precision | 78.17 | 80.14 |
| recall | 77.92 | 80.71 |
| f1-score | 77.55 | 80.37 |
| support | 128.6 | 128.6 |
| Llama-3.1-8B | Llama3-German-8B | |
|---|---|---|
| precision | 62.85 | 63.43 |
| recall | 63.90 | 63.09 |
| f1-score | 62.93 | 62.88 |
| support | 2077.60 | 2077.60 |
First please split the dataset with python3 split_dataset_5cv.py data/Twitter_COVID19.json. You can also use a Jupyter Notebook.
For experiments, define an output and a cache directory in config.py.
For hyperparameter search run:
python3 textclassification_{TASK_NAME}_search.py [MODEL_NAME]
python3 tokenclassification_search.py [MODEL_NAME]
See in best_hp.py hyperparameters for best runs for the selected models.
For fine-tuning a model (models) as text or token classification run:
python3 textclassification.py TASK_NAME [MODEL_NAME(S)]
python3 llm_textclassification.py TASK_NAME [MODEL_NAME(S)]
python3 textclassification.py topic bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large
python3 tokenclassification.py [MODEL_NAME(S)]
python3 llm_tokenclassification.py [MODEL_NAME(S)]
python3 tokenclassification.py bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large