German Twitter COVID-19 Dataset

We present a new dataset with multi-layer annotations for German Covid-related tweets. The dataset includes 643 tweets extracted during the pandemic (from 2020-03-01 to 2022-09-01) using the twint library. The scripts for extracting the tweets can be found in the folder data_extraction.

The annotations include two pipelines for named entities and topic-based credibility annotations. The dataset was annotated in INCEpTION tool by three linguists. For named entities we conduct annotations at token level and the annotations of the credibility pipeline are conducted at tweet level.

Annotation Scheme

We create an annotation pipeline to enhance information extraction from COVID-19-specific social media content. As first step in topic-based credibility annotation pipeline, the annotations aim to filter tweets that are uninformative, unrelated or noncredible. To extract the main entities from the tweets in the second step, we also carry out domain-adapted named entity annotations using typical entities as names of persons, organizations and locations as well as COVID-19-specific entities for the NER-scheme. The annotation scheme is defined below. The corresponding classes are listed in the tables in the section Dataset statistics.

Topic-based Credibility Annotations
- Informativeness
- Topic
- Credibility
Named Entities

For more information about the developing the annotation scheme, the definition of semantic classes, examples etc., see the annotation guidelines.

Dataset statistics

You can use a Count labels.ipynb to count the labels and to find the distribution of classes in the dataset.

Topic-based Credibility Annotations

Task/Classes	#	%
Informativeness
informative	418	65.01
personal_experience	55	8.55
none	170	26.44
Topic
case_report	182	28.30
consequences	30	4.67
governm_decisions	110	17.11
risk_reduction	15	2.33
vaccination	95	14.77
none	211	32.81
Credibility
credible	394	61.28
non-credible	10	1.56
none	239	37.17
total	643	100

Named Entities

Classes	#	%
disease	697	19.41
location	436	12.14
location_body	34	0.95
measure	526	14.65
mortality	241	6.71
organization	315	8.77
person	205	5.71
quantifiers	405	11.28
symptom	369	10.28
time	363	10.11
total	3591	100

Format

The dataset is in JSON format. A unit of the dataset is a tweet with the following information: title, tokens, named_entity_recognition, relations, informativeness, topic and credibility.

{
            "title": "tweet_1519969762578780160.txt",
            "text": "",
            "tokens": [],
            "named_entity_recognition": [
                "O",
                "O",
                "O",
                "O",
                "O",
                "B-DISEASE",
                "O",
                "O",
                "O",
                "O",
                "B-DISEASE",
                "O",
                "O",
                "O",
                "B-ORGANIZATION",
                "O",
                "B-SYMPTOM",
                "O",
                "O",
                "O",
                "B-MEASURE",
                "O",
                "B-SYMPTOM",
                "O",
                "O",
                "O",
                "O",
                "O",
                "B-MEASURE",
                "O",
                "B-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "O",
                "B-MEASURE",
                "B-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "I-QUANTIFIERS",
                "O",
                "O",
                "O",
                "O",
                "O",
                "O",
                "O"
            ],
            "informativeness": "Informative",
            "topic": "Vaccination",
            "credibility": "credible"
        }

Due to redistribution policy of Twitter Content, we cannot publish this dataset. In the folder data you will find an anonymized dataset in which text and tokens are empty (like in the example above).

Evaluation strategy and Metrics

Considering the size of the dataset, we decided to use 5 fold cross-validation. The dataset was split in train and validation datasets (corresponds to 80% and 20% - ca. 514 and 129 tweets for each fold). To evaluate performance we used precision, recall and F1 measure with average weighted by the number of true instances for each label. In NER task, we measuared results using seqeval metric for sequence labelling evaluation. This metric calculates scores for a class as you can see from the results in tables and not for a chunk (e.g. B-PERSON, I-PERSON).

Experiments

We focused on fine-tuning BERT-based models and Llama models from HuggingFace.

Decoders

Selected models and Hyperparameters

In our experiments for text and token classification, we used Huggingface models. We filtered out models for special domains such as legal, medical, financial, hospitality etc. Overall, we have selected five models with the best results for the second test phase:

For each model and task, we initialized 200 trials with the goal of maximizing F1 measure. For text and token classification, hyperparameter space is as follow defined:

hp_textclassification = {
	"learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True),
	"num_train_epochs": trial.suggest_int("num_train_epochs", 1,15),
	"seed": trial.suggest_int("seed", 1, 40),
	"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32]),
	"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
	"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True),
	"gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1,2,4,8,16]),
}

hp_tokenclassification = {
	"learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True),
	"num_train_epochs": trial.suggest_int("num_train_epochs", 1,30),
	"seed": trial.suggest_int("seed", 1, 40),
	"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32]),
	"weight_decay": trial.suggest_float("weight_decay", 1e-12, 1e-1, log=True),
	"adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6, log=True),
	"gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1,2,4,8,16]),
}

Results

Results for 5 fold cross-validation (includes results per class) see in results.

A. Informativeness

	bert-base-german-cased	bert-base-multilingual-cased	bert-base-multilingual-uncased	Twitter/twhin-bert-base	Twitter/twhin-bert-large
precision	82.34	82.11	83.53	86.09	75.80
recall	84.29	82.42	83.82	86.15	81.80
f1-score	82.25	82.09	83.48	85.83	78.56
support	128.6	128.6	128.6	128.6	128.6

B. Topic

	bert-base-german-cased	bert-base-multilingual-cased	bert-base-multilingual-uncased	Twitter/twhin-bert-base	Twitter/twhin-bert-large
precision	79.86	78.06	77.14	78.46	81.29
recall	79.15	77.45	77.28	78.38	80.24
f1-score	78.94	77.25	76.78	77.77	80.07
support	128.6	128.6	128.6	128.6	128.6

C. Credibility

	bert-base-german-cased	bert-base-multilingual-cased	bert-base-multilingual-uncased	Twitter/twhin-bert-base	Twitter/twhin-bert-large
precision	81.07	79.47	79.23	81.76	82.16
recall	82.42	78.99	78.85	82.42	76.63
f1-score	81.59	78.98	78.65	81.53	73.72
support	128.6	128.6	128.6	128.6	128.6

D. Named Entity Recognition

	bert-base-german-cased	bert-base-multilingual-cased	bert-base-multilingual-uncased	Twitter/twhin-bert-base	Twitter/twhin-bert-large
precision	79.94	75.59	75.60	79.21	80.83
recall	81.85	80.75	80.99	83.17	84.10
f1-score	80.67	77.81	78.02	80.95	82.27
support	1628.8	1631.2	1497.6	1547.8	1547.8

Decoders

Selected models and Hyperparameters

From HuggingFace, we have also selected two models with the best results for the second test phase:

For each model and task, we performed parameter-efficient fine-tuning using QLoRa. To avoid overfitting but to evaluate the models on equal terms, we applied the early stopping technique and set other hyperparameters to default. Default hyperparameter are as follow defined:

hyperparameters = (
	learning_rate = 4e-5,
	per_device_train_batch_size = 8,
	per_device_eval_batch_size = 8,
	num_train_epochs = 20,
	eval_steps=50,
	save_steps=50,
	metric_for_best_model="eval_loss",
	weight_decay = 0.01,
	eval_strategy="steps",
)

Results

Results for 5 fold cross-validation (includes results per class) see in results.

A. Informativeness

	Llama-3.1-8B	Llama3-German-8B
precision	80.48	80.71
recall	79.78	80.56
f1-score	79.99	80.43
support	128.6	128.6

B. Topic

	Llama-3.1-8B	Llama3-German-8B
precision	73.45	75.97
recall	71.06	74.79
f1-score	71.27	74.61
support	128.6	128.6

C. Credibility

	Llama-3.1-8B	Llama3-German-8B
precision	78.17	80.14
recall	77.92	80.71
f1-score	77.55	80.37
support	128.6	128.6

D. Named Entity Recognition

	Llama-3.1-8B	Llama3-German-8B
precision	62.85	63.43
recall	63.90	63.09
f1-score	62.93	62.88
support	2077.60	2077.60

Runs

First please split the dataset with python3 split_dataset_5cv.py data/Twitter_COVID19.json. You can also use a Jupyter Notebook.

For experiments, define an output and a cache directory in config.py.

Hyperparameter search

For hyperparameter search run:

python3 textclassification_{TASK_NAME}_search.py [MODEL_NAME]
python3 tokenclassification_search.py [MODEL_NAME]

See in best_hp.py hyperparameters for best runs for the selected models.

Fine-tuning

For fine-tuning a model (models) as text or token classification run:

python3 textclassification.py TASK_NAME [MODEL_NAME(S)]
python3 llm_textclassification.py TASK_NAME [MODEL_NAME(S)]
python3 textclassification.py topic bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large
python3 tokenclassification.py [MODEL_NAME(S)]
python3 llm_tokenclassification.py [MODEL_NAME(S)]
python3 tokenclassification.py bert-base-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased Twitter/twhin-bert-base Twitter/twhin-bert-large

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

German Twitter COVID-19 Dataset

Annotation Scheme

Dataset statistics

Topic-based Credibility Annotations

Named Entities

Format

Evaluation strategy and Metrics

Experiments

Decoders

Selected models and Hyperparameters

Results

A. Informativeness

B. Topic

C. Credibility

D. Named Entity Recognition

Decoders

Selected models and Hyperparameters

Results

A. Informativeness

B. Topic

C. Credibility

D. Named Entity Recognition

Runs

Hyperparameter search

Fine-tuning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bestrun		bestrun
data		data
data_extraction		data_extraction
results		results
.gitattributes		.gitattributes
Count labels.ipynb		Count labels.ipynb
German_Twitter_Annotation_Guidelines.pdf		German_Twitter_Annotation_Guidelines.pdf
README.md		README.md
Split Dataset.ipynb		Split Dataset.ipynb
best_hp.py		best_hp.py
config.py		config.py
llm_textclassification.py		llm_textclassification.py
llm_tokenclassification.py		llm_tokenclassification.py
split_dataset_5cv.py		split_dataset_5cv.py
textclassification.py		textclassification.py
textclassification_credibility_search.py		textclassification_credibility_search.py
textclassification_informativeness_search.py		textclassification_informativeness_search.py
textclassification_topic_search.py		textclassification_topic_search.py
tokenclassification.py		tokenclassification.py
tokenclassification_search.py		tokenclassification_search.py

Folders and files

Latest commit

History

Repository files navigation

German Twitter COVID-19 Dataset

Annotation Scheme

Dataset statistics

Topic-based Credibility Annotations

Named Entities

Format

Evaluation strategy and Metrics

Experiments

Decoders

Selected models and Hyperparameters

Results

A. Informativeness

B. Topic

C. Credibility

D. Named Entity Recognition

Decoders

Selected models and Hyperparameters

Results

A. Informativeness

B. Topic

C. Credibility

D. Named Entity Recognition

Runs

Hyperparameter search

Fine-tuning

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages