Gender Representation Bias Quantification in Gendered Language Corpora

Repository accompanying our paper titled Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora, presented at the 6th Workshop on Gender Bias in Natural Language Processing at ACL 2025. The research and experimental evaluation has been conducted for Spanish and Valencian.

Paper Authors: Erik Derner, Sara Sansalvador de la Fuente, Yoan Gutiérrez, Paloma Moreda, Nuria Oliver

Code and Data Authors: Erik Derner, Sara Sansalvador de la Fuente, Elena Maestre Hernández + sampled data from OPUS

Contact: erik@ellisalicante.org

Repository Structure

code: Code for dataset analysis, continual pretraining, inference, and validation
- bias-quantification: Gender representation bias quantification in a given dataset
- continual-pretraining: Continual pretraining and model inference to evaluate how gender representation bias in training data propagates to the model inference
- validation: Validation of the gender representation bias quantification method on an annotated dataset
data: Samples of corpora, annotated datasets, prompts, few-shot examples, word skiplist, and stories for continual pretraining
- continual-pretraining: Biased and balanced stories datasets generated for continual pretraining experiments
- corpora-en-es: Samples of aligned parallel corpora in English and Spanish used in the experiments in the paper
- dataset-analysis: Prompts, few-shot examples, and skiplist for bias evaluation
- validation: Annotated (ground truth) data for gender representation bias quantification method validation

Getting Started

Prerequisites

Python 3.12
CUDA 12.1 to use GPU

Installation

Clone or download the repository.
Install the required packages:
```
pip install -r requirements.txt
```
Set the environment variables if you want to use API inference with either of the providers:
- OPENAI_API_KEY – OpenAI API key
- GROQ_API_KEY – Groq API key

Usage