PERSONA Benchmark

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

_{Click the logo to read our paper on arXiv}

PERSONA Benchmark (PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ) is the first large-scale benchmark that integrates personalization and multi-turn conversation.

_{In-context prompt construction for personalized conversational inference} _{Performance of GPT-4.1 on PERSONA benchmark}

🚀 Recent News

🗓️2025-05-20 Initial code and dataset released
🗓️2025-05-23 Our preprint arXiv paper released

Why PERSONA Benchmark?

Traditional personalization datasets ignore dialogue structure, while classic multi-turn benchmarks treat users as anonymous. PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ bridges this gap:

Personalized and Conversational – Each task conditions on both the evolving thread and the author’s history.
Realistic Scale – 19 k posts, 111 k conversations, 3 800 + users across 10 domains.
Three Task Families – Classification, regression, and generation allow holistic evaluation.
Strong Baselines – GPT‑4.1, Claude‑3.5, LLaMA‑3, DeepSeek-R1 etc.

Key Features

🌟 Feature	Description
P‑Conv / P‑NonConv / NP‑Conv	Isolate the benefit of personalization and conversation structure
Domain Diversity	10 Reddit domains from worldnews to gaming
Unified Prompting Pipeline	Few‑shot templates for zero‑shot evaluation
Open-Source & Reproducible	MIT-licensed code, Hugging Face mirror, one-command demo

Installation

1 · Create conda environment

conda create -n persona-bench python=3.10 -y
conda activate persona-bench
conda install -c conda-forge git git-lfs -y

2 · Install dependencies

python -m pip install --upgrade pip
pip install -r requirements.txt

3 · Download the Dataset

You can obtain PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ with either of the methods below.
Method A (Google Drive) is recommended because it ships the exact splits used in our experiments.

A. Google Drive — ready‑to‑use release (recommended)

click google cloud to directly download our processed dataset

# 1) Download the required file
Please download the file 'persona_bench_release.zip' from the following link:
https://drive.google.com/file/d/1d8Ju2t0Aa_Nqi5r_cybjvHudYamAN9NX/view?usp=share_link
After downloading, please move this file into a 'data/PERSONA_Bench/' directory relative to where you will run the subsequent commands.
# For example, if you are in your project's root directory, you would save it as:
# [Your Project Root]/data/PERSONA_Bench/merged_posts_remapped.json
# 2) Create the target directory (if you haven't manually created it for the download)
mkdir -p data/PERSONA_Bench
# 3) Verify the downloaded file is in the correct location
Ensure 'merged_posts_remapped.json' is located at 'data/PERSONA_Bench/merged_posts_remapped.json'

# (Optional) If users still want to use gdown after being shown the link,
# You can include this as an alternative:
# Alternative using gdown (if you have it installed):
python -m pip install --upgrade gdown  # Install gdown if not already installed
mkdir -p data/PERSONA_Bench
gdown --id 1d8Ju2t0Aa_Nqi5r_cybjvHudYamAN9NX -O data/PERSONA_Bench/merged_posts_remapped.json

The archive already contains the processed train / dev / test splits, so no further steps are required.

B. Hugging Face — raw records (requires preprocessing)

# Install Git LFS for large‑file support
git lfs install

# Clone the official dataset repository
git clone https://huggingface.co/datasets/PERSONABench/PERSONA-Bench               data/Raw_Data_Postized.json

Next, convert the raw JSON files to the canonical format with
Raw_Data_Process.py: The script writes the processed dataset to data/PERSONA_Bench/, resulting in the same directory structure as Method A.

Quick Start

1 · Demo-level evaluation (quick sanity check)

# ── Sentiment classification ───────────────────────────────────────────────
python demo/3.1PromptMaker.py    # 👈 first open the script and fill in your API key
python demo/GPT3.1.py            #    also adjust INPUT_JSON_FILE if you moved the data

# ── Regression ────────────────────────────────────────────────────────────
python demo/3.2PromptMaker.py    # same edits as above
python demo/GPT3.2.py

# ── Generation ────────────────────────────────────────────────────────────
python demo/3.3PromptMaker.py
python demo/GPT3.3.py

⚠️ API credentials needed
Each demo/GPT*.py script expects an API key (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY).
Edit the file or export the key in your shell before running.

2 · Full-scale evaluation with the official pipeline

Below we use PromptMakers/3.3PromptMaker.py and LLMEvaluators/3.3/GPT4.1.py.
Adopt the same pattern for other tasks / models.

# ① Prepare prompts
python PromptMakers/3.3PromptMaker.py \
    --input_json_file data/splits/generation_test.jsonl      # 👈 change to your path

# ② Run the LLM
python LLMEvaluators/3.3/GPT4.1.py \
    --input_json_file outputs/prompt_3.3_generation.jsonl \  # 👈 result of step ①
    --output_jsonl_file outputs/gpt4.1_generation.jsonl      # 👈 where to save generations

Required edits

File	What to change
PromptMakers/3.3PromptMaker.py	`INPUT_JSON_FILE` → point to the dataset split you want
LLMEvaluators/3.3/GPT4.1.py	`INPUT_JSON_FILE` / `OUTPUT_JSONL_FILE` → prompt file from step ① & save-path
LLMEvaluators/3.3/GPT4.1.py	API key / base URL section at the top

3 · Post-hoc metric computation

If you evaluate DeepSeek or Llama models with the log-only workflow, run the log-level evaluator:

python LLMEvaluators/LogEvaluator/LogEval3.3.py \
    --input_jsonl_file PATH_TO_YOUR_JSONL   # 👈 also update inside the script

Open LogEvaluator and update INPUT_JSONL_FILE (plus any other paths) before running.

Tip: add the project root to PYTHONPATH or run every command from the repo root so that all relative imports resolve correctly.

Dataset

Field	Value
Name	PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ
Size	19 215 posts · 111 239 conversations · 3 878 users
Domains	See at table
License	Research‑only (Reddit TOS)

Domain	Style	Conversational Engagement	Conversation Purpose	User Interactivity
Worldnews	Formal	Debate-Driven	Education	Low
Science		Debate-Driven	Education	Medium
Politics		Debate-Driven	Education	Low
Technology		Information Sharing	Education	Medium
Gaming	Casual	Community-Based	Entertainment	High
Life		Community-Based	Socializing	Medium
Movies		Opinion-Based	Entertainment	Medium
Books		Opinion-Based	Entertainment	High
Entrepreneur	Motivational	Supportive	Advice/Support	High
Art	Creative	Community-Based	Entertainment	Medium

Tested Models

Model	Year	Type
GPT-4.1	2025	Commercial
GPT-4o-mini	2024	Commercial
GPT-4o	2024	Commercial
GPT-4.1-mini	2025	Commercial
Claude-3.5 Sonnet	2024	Commercial
Gemini 2.5 Pro	2025	Commercial
LLaMA-3 80B-Instruct	2024	OSS
DeepSeek-R1	2025	OSS
DeepSeek-V3	2025	OSS
Mistral	2024	OSS
o3	2025	Commercial
o4-mini	2025	Commercial

Results

Task	Metric	Performance Metrics
Task	Metric	GPT-4.1	GPT-4o-mini	Claude-3.5	LLaMA3-3	DeepSeek-R1
		(P-Conv (Ours) \| P-NonConv)
Sentiment Classification	Accuracy	0.9122 \| 0.7862	0.6875 \| 0.6562	0.9109 \| 0.8192	0.8458 \| 0.7305	0.8853 \| 0.7092
	F1	0.9481 \| 0.8720	0.7895 \| 0.7640	0.9474 \| 0.8908	0.8401 \| 0.7495	0.8848 \| 0.7362
	MCC	0.6770 \| 0.2266	0.2268 \| 0.1870	0.6721 \| 0.3666	0.4420 \| 0.2333	0.6070 \| 0.2586
Impact Forecasting	RMSE	310.09 \| 350.29	310.23 \| 351.50	282.48 \| 344.75	319.83 \| 350.43	300.03 \| 353.80
Impact Forecasting	MAE	97.52 \| 113.46	97.64 \| 115.80	85.39 \| 109.27	101.25 \| 113.18	89.59 \| 112.52
Next-Text Generation	ROUGE-1	0.2777 \| 0.2248	0.2491 \| 0.2121	0.2161 \| 0.1645	0.2055 \| 0.1540	0.1786 \| 0.1359
	ROUGE-L	0.2115 \| 0.1565	0.1906 \| 0.1470	0.1719 \| 0.1130	0.1572 \| 0.1009	0.1395 \| 0.0911
	METEOR	0.2677 \| 0.2316	0.2120 \| 0.2198	0.1913 \| 0.1636	0.1838 \| 0.1659	0.1649 \| 0.1401
	BLEU	0.0604 \| 0.0206	0.0330 \| 0.0170	0.0549 \| 0.0123	0.0480 \| 0.0089	0.0423 \| 0.0083
	SBERT	0.4757 \| 0.4322	0.4381 \| 0.3982	0.3942 \| 0.3512	0.3733 \| 0.3339	0.3699 \| 0.3307

Project Structure

├─ demo/
├─ LLMEvaluators/
├─ PromptMakers/
├─ data/
├─ config/
├─ requirements.txt
├─ Raw_Data_Process.py
└─ README.md

Citation

Thank you for your interest in our work!

@misc{li2025personalizedconversationalbenchmarksimulating,
  title={A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations},
  author={Li Li and Peilin Cai and Ryan A. Rossi and Franck Dernoncourt and Branislav Kveton and Junda Wu and Tong Yu and Linxin Song and Tiankai Yang and Yuehan Qin and Nesreen K. Ahmed and Samyadeep Basu and Subhojyoti Mukherjee and Ruiyi Zhang and Zhengmian Hu and Bo Ni and Yuxiao Zhou and Zichao Wang and Yue Huang and Yu Wang and Xiangliang Zhang and Philip S. Yu and Xiyang Hu and Yue Zhao},
  year={2025},
  eprint={2505.14106},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.14106}
}

Contact

Feel free to make an issue or send me email

Task	Metric	Performance Metrics
Task	Metric	GPT-4.1	GPT-4o-mini	Claude-3.5	LLaMA3-3	DeepSeek-R1
		(P-Conv (Ours) \| P-NonConv)
Sentiment Classification	Accuracy	0.9122 \| 0.7862	0.6875 \| 0.6562	0.9109 \| 0.8192	0.8458 \| 0.7305	0.8853 \| 0.7092
	F1	0.9481 \| 0.8720	0.7895 \| 0.7640	0.9474 \| 0.8908	0.8401 \| 0.7495	0.8848 \| 0.7362
	MCC	0.6770 \| 0.2266	0.2268 \| 0.1870	0.6721 \| 0.3666	0.4420 \| 0.2333	0.6070 \| 0.2586
Impact Forecasting	RMSE	310.09 \| 350.29	310.23 \| 351.50	282.48 \| 344.75	319.83 \| 350.43	300.03 \| 353.80
Impact Forecasting	MAE	97.52 \| 113.46	97.64 \| 115.80	85.39 \| 109.27	101.25 \| 113.18	89.59 \| 112.52
Next-Text Generation	ROUGE-1	0.2777 \| 0.2248	0.2491 \| 0.2121	0.2161 \| 0.1645	0.2055 \| 0.1540	0.1786 \| 0.1359
	ROUGE-L	0.2115 \| 0.1565	0.1906 \| 0.1470	0.1719 \| 0.1130	0.1572 \| 0.1009	0.1395 \| 0.0911
	METEOR	0.2677 \| 0.2316	0.2120 \| 0.2198	0.1913 \| 0.1636	0.1838 \| 0.1659	0.1649 \| 0.1401
	BLEU	0.0604 \| 0.0206	0.0330 \| 0.0170	0.0549 \| 0.0123	0.0480 \| 0.0089	0.0423 \| 0.0083
	SBERT	0.4757 \| 0.4322	0.4381 \| 0.3982	0.3942 \| 0.3512	0.3733 \| 0.3339	0.3699 \| 0.3307

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERSONA Benchmark

🚀 Recent News

📖 Table of Contents

Why PERSONA Benchmark?

Key Features

Installation

1 · Create conda environment

2 · Install dependencies

3 · Download the Dataset

A. Google Drive — ready‑to‑use release (recommended)

B. Hugging Face — raw records (requires preprocessing)

Quick Start

1 · Demo-level evaluation (quick sanity check)

2 · Full-scale evaluation with the official pipeline

3 · Post-hoc metric computation

Dataset

Tested Models

Results

Project Structure

Citation

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PERSONA Benchmark

🚀 Recent News

📖 Table of Contents

Why PERSONA Benchmark?

Key Features

Installation

1 · Create conda environment

2 · Install dependencies

3 · Download the Dataset

A. Google Drive — ready‑to‑use release (recommended)

B. Hugging Face — raw records (requires preprocessing)

Quick Start

1 · Demo-level evaluation (quick sanity check)

2 · Full-scale evaluation with the official pipeline

3 · Post-hoc metric computation

Dataset

Tested Models

Results

Project Structure

Citation

Contact