Spanish-Infodemia-ABSA-2020 is a Spanish Twitter corpus for Aspect-Based Sentiment Analysis (ABSA) related to the COVID-19 infodemic. The dataset was originally released in 2020 and manually annotated following a voting-based agreement procedure.
This repository provides both the original data distribution used in the associated publication and an updated, FAIR-compliant version designed to facilitate reuse and long-term preservation.
- Introduction of Spanish-Infodemia-ABSA-2020, an ontology-driven corpus for Aspect-Based Sentiment Analysis (ABSA) on infectious diseases in Latin America.
- Tweets were collected and annotated around emerging infectious diseases including Dengue, Zika, Chikungunya, Influenza, Yellow Fever, etc.
- The annotation incorporates:
- Aspect identification based on an infectious-disease ontology
- Sentiment polarity toward each aspect
- General tweet polarity
- Includes a domain ontology describing relevant entities, symptoms, risk factors, transmission aspects, and preventive measures.
- Evaluation of multiple ML/NLP models combining:
- Linguistic features
- Ontology-driven features
- Word embeddings (FastText)
- The ontology-enhanced models outperform classical approaches.
-
José Antonio García-Díaz — University of Murcia
Google Scholar · ORCID -
Mar Cánovas-García — University of Murcia
-
Ricardo Colomo-Palacios — Østfold University College
Google Scholar · ORCID -
Rafael Valencia-García — University of Murcia
Google Scholar · ORCID
Affiliations:
- Departamento de Informática y Sistemas, Universidad de Murcia, Spain
- Faculty of Computer Sciences, Østfold University College, Norway
This repository accompanies the following article:
Ontology-driven aspect-based sentiment analysis classification: An infodemiological case study regarding infectious diseases in Latin America
Future Generation Computer Systems, Volume 114, 2021, Pages 506–518.
https://doi.org/10.1016/j.future.2020.08.032
Public health surveillance increasingly leverages social media data to monitor outbreaks, misinformation, and public attitudes toward infectious diseases. However, analyzing health-related content in Spanish presents linguistic challenges and requires domain knowledge. This work proposes an ontology-driven approach to Aspect-Based Sentiment Analysis (ABSA) applied to infectious diseases of major relevance in Latin America. We introduce a domain ontology capturing disease entities, symptoms, transmission modes, treatments, and preventive actions, and we annotate a corpus of Spanish tweets with aspects and sentiment polarity. A hybrid machine learning approach combining linguistic features, ontology-driven features, and word embeddings is evaluated. Results show that ontology-enriched models significantly outperform baseline approaches, demonstrating the importance of domain semantics for infodemiological analysis.
The repository contains two main data distributions:
- The original version of the dataset follows the structure used in the associated publication and includes tweet text and derived linguistic annotations. This distribution is preserved for reproducibility purposes.
- A consolidated, tabular version of the dataset is provided to improve interoperability and reuse, following the FAIR principles.
- corpus/infodemia_absa_public.csv
- Public version of the dataset.
- Contains tweet identifiers, sentiment labels, temporal information, and annotation agreement metadata.
- Does not include tweet text or derived textual representations.
- Fully compliant with Twitter Terms of Service.
It also contains a restricted version including tweet text available for research purposes upon request.
Due to Twitter’s Terms of Service, the publicly available version of the dataset only includes tweet identifiers and annotations.
The full version of the dataset, including tweet text and derived linguistic features, can be made available for research purposes upon request and under controlled conditions. https://forms.gle/LuouKBCLYUXYRUJJ8
An example of the dataset is shown below:
Tweets were manually annotated following a voting-based agreement procedure. The dataset includes several agreement-related fields (e.g. number of votes, agreement level, and confidence score) to support transparency and reuse.
Details about the annotation process and agreement computation are described in the associated publication.
The following pipeline illustrates the workflow followed in the study:
- Data acquisition from Twitter
- Corpus classification and preprocessing
- Feature extraction for sentiment classification
- Model evaluation using neural architectures (BiLSTM, CNN, LSTM) and MLP
The integration of linguistic modelling and semantic normalization supports robust, domain-aware sentiment analysis for infectious disease surveillance.
The original article evaluates several hybrid models combining:
- Linguistic features (LF)
- Word embeddings (FastText)
- Classical ML classifiers (SVM, RF, SMO)
- The infectious-disease ontology proved essential for identifying, organizing, and normalizing aspect categories, ensuring consistent ABSA annotation across heterogeneous Spanish-language tweets from Latin America.
- The ontology enabled the extraction of medically relevant aspects (e.g., symptoms, transmission, prevention, treatments), which would be difficult to infer reliably using purely data-driven methods.
- Classification models based on linguistic features and word embeddings achieved strong performance in predicting sentiment toward each aspect, outperforming bag-of-words baselines.
- Word-embedding models (FastText) captured regional lexical variation across Latin American Spanish, improving the detection of disease-related terms and sentiment expressions.
- The combination of ontology-driven aspect extraction with linguistic + embedding-based classifiers provides a robust pipeline for infodemiological analysis on social media.
This figure shows the distribution of linguistic features across the three sentiment classes (positive, neutral, negative). It provides insight into how specific textual characteristics—such as numerals, syllable count, prepositions, verb types, or readability—vary depending on the user’s expressed attitude. Negative tweets often exhibit higher proportions of emotionally charged or structurally complex features, while positive and neutral tweets tend to remain more homogeneous. This analysis supports the design of feature-based sentiment classification models used in the study.
To observe the relation between the concepts of the ontology, we design a tool that highlights ontology-mapped concepts directly within tweet text, showing how domain knowledge is used to identify aspects relevant to infectious diseases. It also provides TF-IDF and TF-IDF_E weights for each concept, supporting both annotation and feature analysis workflows.
Ontology-based highlighting of concepts and TF-IDF scoring for semantic enrichment.
This interface enables interactive exploration of tweets related to infectious diseases. Users can filter tweets by disease or aspect category and inspect sentiment predictions at the tweet and aspect levels.
The system also displays related concepts (e.g., symptoms, transmission processes, regions) along with their positive/neutral/negative sentiment scores.
This visualization demonstrates how ontology-driven aspect extraction enhances interpretability and supports infodemiological monitoring.
The dataset is distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licence.
This licence applies to the dataset annotations and structure. Tweet content remains subject to Twitter’s Terms of Service.
The dataset was created and annotated by the authors of the associated publication. The original release did not include a FAIR-oriented distribution.
A consolidated version has been introduced to improve interoperability and long-term reuse, while preserving the original data distribution used in the publication.
This work has been supported by the Spanish National Research Agency (AEI) and the European Regional Development Fund (FEDER/ERDF) through projects KBS4FIA (TIN2016-76323-R) and LaTe4PSP (PID2019-107652RB-I00). In addition, José Antonio García-Díaz has been supported by Banco Santander and University of Murcia through the Doctorado industrial programme.
@article{garcia2020ontology,
title={Ontology-driven aspect-based sentiment analysis classification: An infodemiological case study regarding infectious diseases in Latin America},
author={Garc{\'\i}a-D{\'\i}az, Jos{\'e} Antonio and C{\'a}novas-Garc{\'\i}a, Mar and Valencia-Garc{\'\i}a, Rafael},
journal={Future Generation Computer Systems},
volume={112},
pages={641--657},
year={2020},
publisher={Elsevier}
}
or the Zenodo record
García-Díaz, J. A. et al. (2026).
Spanish-Infodemia-ABSA-2020: A Spanish Twitter Corpus for Aspect-Based Sentiment Analysis.
Zenodo. https://doi.org/10.5281/zenodo.18245405




