Spanish PoliCorpus 2020

Psychographic Traits Identification Based on Political Ideology: A Behaviour Analysis Study on Spanish Politicians' Tweets Posted in 2020

TL-DR: Highlights

Creation of a Spanish dataset containing political tweets authored by official Spanish politicians during 2020.
Annotation and computational modelling of political ideology (left-wing vs. right-wing) from textual and behavioural signals.
Exploration of psychographic traits and behavioural cues in political communication on social media.
Evaluation of multiple machine learning and deep learning models using linguistic, semantic, and interaction-level features.
Identification of discriminative linguistic patterns that correlate strongly with political ideology.

Authors

José Antonio García-Díaz — University of Murcia
Google Scholar · ORCID
Ricardo Colomo-Palacios — Østfold University College
Google Scholar · ORCID
Rafael Valencia-García — University of Murcia
Google Scholar · ORCID

Affiliations:

Departamento de Informática y Sistemas, Universidad de Murcia, Spain

Faculty of Computer Sciences, Østfold University College, Norway

Publication

This article was published in Future Generation Computer Systems (FGCS), Volume 129, November 2021, Pages 138–152.
DOI: https://doi.org/10.1016/j.future.2021.01.015
Publisher page: https://www.sciencedirect.com/science/article/pii/S0167739X21004921

Abstract

Political ideology strongly shapes how individuals interpret, communicate, and engage with political content online. In this study, we present a novel dataset composed of tweets published in 2020 by official Spanish politicians, labelled according to their political ideology. Using this corpus, we explore the extent to which linguistic, semantic, psychographic, and behavioural features can reveal ideological leaning. We evaluate a broad range of machine learning and deep learning models and analyse the psychographic traits associated with ideological groups. Our results show that several linguistic patterns and interaction-level behaviours provide robust signals for political ideology classification, enabling richer understanding of political communication in social media environments.

Relation to Shared Tasks

Part of this dataset was used as the foundation for the PoliticES 2022 shared task on political ideology detection in Spanish, organised within IberLEF.
The task description and results are available in the SEPLN journal:

PoliticES 2022 overview: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6446
CodaLab competition page: https://codalab.lisn.upsaclay.fr/competitions/1948

This shared task helped benchmark the dataset in a competitive evaluation setting, providing further validation of its usefulness for political ideology and author profiling research.

Dataset

Spanish-PoliCorpus-2020 is a Twitter-based corpus designed for research on political author profiling and author attribution in Spanish.

Each instance in the dataset corresponds to a tweet and is associated with a pseudonymised author identifier. Author-level traits are repeated across all tweets written by the same author. The dataset supports multiple experimental tasks through independent data splits.

Each instance includes:

The text of the tweet
Metadata (author, date, retweets, interactions)
The manually annotated political ideology of the author
Additional psychographic traits used for analysis
Preprocessed and raw text files

Dataset distribution

Next, we show the label distribution per trait.

Demographic trait: gender

Trait	Class	Total	Train	Val	Test
Gender	female	113	67	23	23
	male	156	99	29	28

Demographic trait: age range

Trait	Class	Total	Train	Val	Test
Age	25-34	28	21	1	6
	35-49	126	80	23	23
	50-64	104	57	26	21
	over 65	11	8	2	1

Psychograph trait: political spectrum (binary)

Trait	Class	Total	Train	Val	Test
Spectrum	left	146	88	31	27
(binary)	right	123	78	21	24
Spectrum	left	56	37	12	7
(multiclass)	m-left	90	51	19	20
	m-right	83	54	15	14
	right	39	23	6	10

Psychograph trait: political spectrum of journalists (used for evaluation)

Trait	Class	Total	Train	Val	Test
Spectrum	left	31	-	-	31
(binary)	right	20	-	-	20

Trait	Class	Total	Train	Val	Test
Spectrum	left	20	-	-	20
(multiclass)	m-left	11	-	-	11
	m-right	13	-	-	13
	right	7	-	-	7

Data fields

The public version of the dataset includes the following fields:

twitter_id: Twitter identifier of the tweet.
author_id: Pseudonymised author identifier.
gender: Author gender label.
age_range: Author age range label.
ideological_binary: Binary ideological orientation.
ideological_multiclass: Multiclass ideological orientation.
split_author_profiling: Data split for the author profiling task.
split_author_attribution: Data split for the author attribution task.
source: Data source (Twitter).

Access

This dataset has been curated following the FAIR (Findable, Accessible, Interoperable, Reusable) data principles.

A FAIR self-assessment has been conducted using the FAIR Data Self-Assessment Tool (FAISS), documenting:

the assignment of a persistent DOI via Zenodo,
the availability of rich, machine-readable metadata,
clear access conditions and licensing,
and detailed provenance and anonymisation information.

The dataset is publicly available through Zenodo and GitHub and is intended for long-term reuse in research contexts.

To request access, please complete the following form: https://forms.gle/wq9tF26r4mgnjHVL9

Architecture

Evaluation Summary

The article evaluates several models including:

Logistic Regression
SVM
Random Forest
LSTM and BiLSTM architectures
Transformer-based sentence embeddings
Psychographic and metadata-driven features

The best performing models combine textual features with behavioural metadata, confirming that ideology can be inferred not only from language but also from patterns of interaction, self-presentation, and posting behaviour.

Acknowledgments

This paper is part of the research project LaTe4PSP (PID2019-107652RB-I00) funded by MCIN/AEI/10.13039/501100011033. In addition, José Antonio García-Díaz was supported by Banco Santander and the University of Murcia through the Doctorado industrial programme.

Citation

@article{garcia2022psychographic,
  title={Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020},
  author={Garc{\'\i}a-D{\'\i}az, Jos{\'e} Antonio and Colomo-Palacios, Ricardo and Valencia-Garc{\'\i}a, Rafael},
  journal={Future Generation Computer Systems},
  volume={130},
  pages={59--74},
  year={2022},
  publisher={Elsevier}
}

Or cite the Zenodo record

García-Díaz, J. A. et al. (2026).
Spanish-PoliCorpus-2020: A Spanish Twitter Corpus for Author Profiling and Attribution.
Zenodo. https://doi.org/10.5281/zenodo.18245744

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
config		config
dataset		dataset
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENCE		LICENCE
README.md		README.md
metadata.json		metadata.json
policorpus-architecture-1.png		policorpus-architecture-1.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spanish PoliCorpus 2020

Psychographic Traits Identification Based on Political Ideology: A Behaviour Analysis Study on Spanish Politicians' Tweets Posted in 2020

TL-DR: Highlights

Authors

Publication

Abstract

Relation to Shared Tasks

Dataset

Dataset distribution

Demographic trait: gender

Demographic trait: age range

Psychograph trait: political spectrum (binary)

Psychograph trait: political spectrum of journalists (used for evaluation)

Data fields

Access

Architecture

Evaluation Summary

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spanish PoliCorpus 2020

Psychographic Traits Identification Based on Political Ideology: A Behaviour Analysis Study on Spanish Politicians' Tweets Posted in 2020

TL-DR: Highlights

Authors

Publication

Abstract

Relation to Shared Tasks

Dataset

Dataset distribution

Demographic trait: gender

Demographic trait: age range

Psychograph trait: political spectrum (binary)

Psychograph trait: political spectrum of journalists (used for evaluation)

Data fields

Access

Architecture

Evaluation Summary

Acknowledgments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages