Psychographic Traits Identification Based on Political Ideology: A Behaviour Analysis Study on Spanish Politicians' Tweets Posted in 2020
- Creation of a Spanish dataset containing political tweets authored by official Spanish politicians during 2020.
- Annotation and computational modelling of political ideology (left-wing vs. right-wing) from textual and behavioural signals.
- Exploration of psychographic traits and behavioural cues in political communication on social media.
- Evaluation of multiple machine learning and deep learning models using linguistic, semantic, and interaction-level features.
- Identification of discriminative linguistic patterns that correlate strongly with political ideology.
-
José Antonio García-Díaz — University of Murcia
Google Scholar · ORCID -
Ricardo Colomo-Palacios — Østfold University College
Google Scholar · ORCID -
Rafael Valencia-García — University of Murcia
Google Scholar · ORCID
Affiliations:
- Departamento de Informática y Sistemas, Universidad de Murcia, Spain
- Faculty of Computer Sciences, Østfold University College, Norway
This article was published in Future Generation Computer Systems (FGCS), Volume 129, November 2021, Pages 138–152.
DOI: https://doi.org/10.1016/j.future.2021.01.015
Publisher page: https://www.sciencedirect.com/science/article/pii/S0167739X21004921
Political ideology strongly shapes how individuals interpret, communicate, and engage with political content online. In this study, we present a novel dataset composed of tweets published in 2020 by official Spanish politicians, labelled according to their political ideology. Using this corpus, we explore the extent to which linguistic, semantic, psychographic, and behavioural features can reveal ideological leaning. We evaluate a broad range of machine learning and deep learning models and analyse the psychographic traits associated with ideological groups. Our results show that several linguistic patterns and interaction-level behaviours provide robust signals for political ideology classification, enabling richer understanding of political communication in social media environments.
Part of this dataset was used as the foundation for the PoliticES 2022 shared task on political ideology detection in Spanish, organised within IberLEF.
The task description and results are available in the SEPLN journal:
- PoliticES 2022 overview: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6446
- CodaLab competition page: https://codalab.lisn.upsaclay.fr/competitions/1948
This shared task helped benchmark the dataset in a competitive evaluation setting, providing further validation of its usefulness for political ideology and author profiling research.
Spanish-PoliCorpus-2020 is a Twitter-based corpus designed for research on political author profiling and author attribution in Spanish.
Each instance in the dataset corresponds to a tweet and is associated with a pseudonymised author identifier. Author-level traits are repeated across all tweets written by the same author. The dataset supports multiple experimental tasks through independent data splits.
Each instance includes:
- The text of the tweet
- Metadata (author, date, retweets, interactions)
- The manually annotated political ideology of the author
- Additional psychographic traits used for analysis
- Preprocessed and raw text files
Next, we show the label distribution per trait.
| Trait | Class | Total | Train | Val | Test |
|---|---|---|---|---|---|
| Gender | female | 113 | 67 | 23 | 23 |
| male | 156 | 99 | 29 | 28 |
| Trait | Class | Total | Train | Val | Test |
|---|---|---|---|---|---|
| Age | 25-34 | 28 | 21 | 1 | 6 |
| 35-49 | 126 | 80 | 23 | 23 | |
| 50-64 | 104 | 57 | 26 | 21 | |
| over 65 | 11 | 8 | 2 | 1 |
| Trait | Class | Total | Train | Val | Test |
|---|---|---|---|---|---|
| Spectrum | left | 146 | 88 | 31 | 27 |
| (binary) | right | 123 | 78 | 21 | 24 |
| Spectrum | left | 56 | 37 | 12 | 7 |
| (multiclass) | m-left | 90 | 51 | 19 | 20 |
| m-right | 83 | 54 | 15 | 14 | |
| right | 39 | 23 | 6 | 10 |
| Trait | Class | Total | Train | Val | Test |
|---|---|---|---|---|---|
| Spectrum | left | 31 | - | - | 31 |
| (binary) | right | 20 | - | - | 20 |
| Trait | Class | Total | Train | Val | Test |
|---|---|---|---|---|---|
| Spectrum | left | 20 | - | - | 20 |
| (multiclass) | m-left | 11 | - | - | 11 |
| m-right | 13 | - | - | 13 | |
| right | 7 | - | - | 7 |
The public version of the dataset includes the following fields:
twitter_id: Twitter identifier of the tweet.author_id: Pseudonymised author identifier.gender: Author gender label.age_range: Author age range label.ideological_binary: Binary ideological orientation.ideological_multiclass: Multiclass ideological orientation.split_author_profiling: Data split for the author profiling task.split_author_attribution: Data split for the author attribution task.source: Data source (Twitter).
This dataset has been curated following the FAIR (Findable, Accessible, Interoperable, Reusable) data principles.
A FAIR self-assessment has been conducted using the FAIR Data Self-Assessment Tool (FAISS), documenting:
- the assignment of a persistent DOI via Zenodo,
- the availability of rich, machine-readable metadata,
- clear access conditions and licensing,
- and detailed provenance and anonymisation information.
The dataset is publicly available through Zenodo and GitHub and is intended for long-term reuse in research contexts.
To request access, please complete the following form: https://forms.gle/wq9tF26r4mgnjHVL9
The article evaluates several models including:
- Logistic Regression
- SVM
- Random Forest
- LSTM and BiLSTM architectures
- Transformer-based sentence embeddings
- Psychographic and metadata-driven features
The best performing models combine textual features with behavioural metadata, confirming that ideology can be inferred not only from language but also from patterns of interaction, self-presentation, and posting behaviour.
This paper is part of the research project LaTe4PSP (PID2019-107652RB-I00) funded by MCIN/AEI/10.13039/501100011033. In addition, José Antonio García-Díaz was supported by Banco Santander and the University of Murcia through the Doctorado industrial programme.
@article{garcia2022psychographic,
title={Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020},
author={Garc{\'\i}a-D{\'\i}az, Jos{\'e} Antonio and Colomo-Palacios, Ricardo and Valencia-Garc{\'\i}a, Rafael},
journal={Future Generation Computer Systems},
volume={130},
pages={59--74},
year={2022},
publisher={Elsevier}
}
Or cite the Zenodo record
García-Díaz, J. A. et al. (2026).
Spanish-PoliCorpus-2020: A Spanish Twitter Corpus for Author Profiling and Attribution.
Zenodo. https://doi.org/10.5281/zenodo.18245744
