A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations
PERSONA Benchmark (Pα΄Κκ±α΄Ι΄α΄Cα΄Ι΄α΄ Bα΄Ι΄α΄Κ) is the first large-scale benchmark that integrates personalization and multi-turn conversation.
![]() In-context prompt construction for personalized conversational inference
|
![]() Performance of GPT-4.1 on PERSONA benchmark
|
- ποΈ2025-05-20Β Β Initial code and dataset released
- ποΈ2025-05-23Β Β Our preprint arXiv paper released
- Why PERSONA Benchmark?
- Key Features
- Installation
- Quick Start
- Dataset
- Tested Models
- Results
- Project Structure
- Citation
- Contact
Traditional personalization datasets ignore dialogue structure, while classic multi-turn benchmarks treat users as anonymous. Pα΄Κκ±α΄Ι΄α΄Cα΄Ι΄α΄ Bα΄Ι΄α΄Κ bridges this gap:
- Personalized and Conversational β Each task conditions on both the evolving thread and the authorβs history.
- Realistic Scale β 19β―k posts, 111β―k conversations, 3β―800β―+ users across 10 domains.
- Three Task Families β Classification, regression, and generation allow holistic evaluation.
- Strong Baselines β GPTβ4.1, Claudeβ3.5, LLaMAβ3, DeepSeek-R1 etc.
| π Feature | Description |
|---|---|
| PβConv / PβNonConv / NPβConv | Isolate the benefit of personalization and conversation structure |
| Domain Diversity | 10 Reddit domains from worldnews to gaming |
| Unified Prompting Pipeline | Fewβshot templates for zeroβshot evaluation |
| Open-Source & Reproducible | MIT-licensed code, Hugging Face mirror, one-command demo |
conda create -n persona-bench python=3.10 -y
conda activate persona-bench
conda install -c conda-forge git git-lfs -ypython -m pip install --upgrade pip
pip install -r requirements.txtYou can obtain Pα΄Κκ±α΄Ι΄α΄Cα΄Ι΄α΄ Bα΄Ι΄α΄Κ with either of the methods below.
MethodΒ A (GoogleΒ Drive) is recommended because it ships the exact splits used in our experiments.
click google cloud to directly download our processed dataset
# 1) Download the required file
Please download the file 'persona_bench_release.zip' from the following link:
https://drive.google.com/file/d/1d8Ju2t0Aa_Nqi5r_cybjvHudYamAN9NX/view?usp=share_link
After downloading, please move this file into a 'data/PERSONA_Bench/' directory relative to where you will run the subsequent commands.
# For example, if you are in your project's root directory, you would save it as:
# [Your Project Root]/data/PERSONA_Bench/merged_posts_remapped.json
# 2) Create the target directory (if you haven't manually created it for the download)
mkdir -p data/PERSONA_Bench
# 3) Verify the downloaded file is in the correct location
Ensure 'merged_posts_remapped.json' is located at 'data/PERSONA_Bench/merged_posts_remapped.json'# (Optional) If users still want to use gdown after being shown the link,
# You can include this as an alternative:
# Alternative using gdown (if you have it installed):
python -m pip install --upgrade gdown # Install gdown if not already installed
mkdir -p data/PERSONA_Bench
gdown --id 1d8Ju2t0Aa_Nqi5r_cybjvHudYamAN9NX -O data/PERSONA_Bench/merged_posts_remapped.jsonThe archive already contains the processed train / dev / test splits, so no further steps are required.
# Install GitΒ LFS for largeβfile support
git lfs install
# Clone the official dataset repository
git clone https://huggingface.co/datasets/PERSONABench/PERSONA-Bench data/Raw_Data_Postized.jsonNext, convert the raw JSON files to the canonical format with
Raw_Data_Process.py:
The script writes the processed dataset to data/PERSONA_Bench/, resulting in the same directory structure as MethodΒ A.
# ββ Sentiment classification βββββββββββββββββββββββββββββββββββββββββββββββ
python demo/3.1PromptMaker.py # π first open the script and fill in your API key
python demo/GPT3.1.py # also adjust INPUT_JSON_FILE if you moved the data
# ββ Regression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
python demo/3.2PromptMaker.py # same edits as above
python demo/GPT3.2.py
# ββ Generation ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
python demo/3.3PromptMaker.py
python demo/GPT3.3.py
β οΈ API credentials needed
Eachdemo/GPT*.pyscript expects an API key (e.g.,OPENAI_API_KEY,ANTHROPIC_API_KEY).
Edit the file or export the key in your shell before running.
Below we use PromptMakers/3.3PromptMaker.py and LLMEvaluators/3.3/GPT4.1.py.
Adopt the same pattern for other tasks / models.
# β Prepare prompts
python PromptMakers/3.3PromptMaker.py \
--input_json_file data/splits/generation_test.jsonl # π change to your path
# β‘ Run the LLM
python LLMEvaluators/3.3/GPT4.1.py \
--input_json_file outputs/prompt_3.3_generation.jsonl \ # π result of step β
--output_jsonl_file outputs/gpt4.1_generation.jsonl # π where to save generationsRequired edits
| File | What to change |
|---|---|
| PromptMakers/3.3PromptMaker.py | INPUT_JSON_FILE β point to the dataset split you want |
| LLMEvaluators/3.3/GPT4.1.py | INPUT_JSON_FILE / OUTPUT_JSONL_FILE β prompt file from step β & save-path |
| LLMEvaluators/3.3/GPT4.1.py | API key / base URL section at the top |
If you evaluate DeepSeek or Llama models with the log-only workflow, run the log-level evaluator:
python LLMEvaluators/LogEvaluator/LogEval3.3.py \
--input_jsonl_file PATH_TO_YOUR_JSONL # π also update inside the scriptOpen LogEvaluator and update INPUT_JSONL_FILE (plus any other paths) before running.
Tip: add the project root to PYTHONPATH or run every command from the repo root so that all relative imports resolve correctly.
| Field | Value |
|---|---|
| Name | Pα΄Κκ±α΄Ι΄α΄Cα΄Ι΄α΄ Bα΄Ι΄α΄Κ |
| Size | 19β―215 posts Β· 111β―239 conversations Β· 3β―878 users |
| Domains | See at table |
| License | Researchβonly (Reddit TOS) |
| Domain | Style | Conversational Engagement | Conversation Purpose | User Interactivity |
|---|---|---|---|---|
| Worldnews | Formal | Debate-Driven | Education | Low |
| Science | Debate-Driven | Education | Medium | |
| Politics | Debate-Driven | Education | Low | |
| Technology | Information Sharing | Education | Medium | |
| Gaming | Casual | Community-Based | Entertainment | High |
| Life | Community-Based | Socializing | Medium | |
| Movies | Opinion-Based | Entertainment | Medium | |
| Books | Opinion-Based | Entertainment | High | |
| Entrepreneur | Motivational | Supportive | Advice/Support | High |
| Art | Creative | Community-Based | Entertainment | Medium |
| Model | Year | Type |
|---|---|---|
| GPT-4.1 | 2025 | Commercial |
| GPT-4o-mini | 2024 | Commercial |
| GPT-4o | 2024 | Commercial |
| GPT-4.1-mini | 2025 | Commercial |
| Claude-3.5 Sonnet | 2024 | Commercial |
| Gemini 2.5 Pro | 2025 | Commercial |
| LLaMA-3 80B-Instruct | 2024 | OSS |
| DeepSeek-R1 | 2025 | OSS |
| DeepSeek-V3 | 2025 | OSS |
| Mistral | 2024 | OSS |
| o3 | 2025 | Commercial |
| o4-mini | 2025 | Commercial |
| Task | Metric | PerformanceΒ Metrics | ||||
|---|---|---|---|---|---|---|
| GPT-4.1 | GPT-4o-mini | Claude-3.5 | LLaMA3-3 | DeepSeek-R1 | ||
| (P-ConvΒ (Ours)Β |Β P-NonConv) | ||||||
| SentimentΒ Classification |
Accuracy | 0.9122Β |Β 0.7862 | 0.6875Β |Β 0.6562 | 0.9109Β |Β 0.8192 | 0.8458Β |Β 0.7305 | 0.8853Β |Β 0.7092 |
| F1 | 0.9481Β |Β 0.8720 | 0.7895Β |Β 0.7640 | 0.9474Β |Β 0.8908 | 0.8401Β |Β 0.7495 | 0.8848Β |Β 0.7362 | |
| MCC | 0.6770Β |Β 0.2266 | 0.2268Β |Β 0.1870 | 0.6721Β |Β 0.3666 | 0.4420Β |Β 0.2333 | 0.6070Β |Β 0.2586 | |
| ImpactΒ Forecasting |
RMSE | 310.09Β |Β 350.29 | 310.23Β |Β 351.50 | 282.48Β |Β 344.75 | 319.83Β |Β 350.43 | 300.03Β |Β 353.80 |
| MAE | 97.52Β |Β 113.46 | 97.64Β |Β 115.80 | 85.39Β |Β 109.27 | 101.25Β |Β 113.18 | 89.59Β |Β 112.52 | |
| Next-TextΒ Generation |
ROUGE-1 | 0.2777Β |Β 0.2248 | 0.2491Β |Β 0.2121 | 0.2161Β |Β 0.1645 | 0.2055Β |Β 0.1540 | 0.1786Β |Β 0.1359 |
| ROUGE-L | 0.2115Β |Β 0.1565 | 0.1906Β |Β 0.1470 | 0.1719Β |Β 0.1130 | 0.1572Β |Β 0.1009 | 0.1395Β |Β 0.0911 | |
| METEOR | 0.2677Β |Β 0.2316 | 0.2120Β |Β 0.2198 | 0.1913Β |Β 0.1636 | 0.1838Β |Β 0.1659 | 0.1649Β |Β 0.1401 | |
| BLEU | 0.0604Β |Β 0.0206 | 0.0330Β |Β 0.0170 | 0.0549Β |Β 0.0123 | 0.0480Β |Β 0.0089 | 0.0423Β |Β 0.0083 | |
| SBERT | 0.4757Β |Β 0.4322 | 0.4381Β |Β 0.3982 | 0.3942Β |Β 0.3512 | 0.3733Β |Β 0.3339 | 0.3699Β |Β 0.3307 | |
ββ demo/
ββ LLMEvaluators/
ββ PromptMakers/
ββ data/
ββ config/
ββ requirements.txt
ββ Raw_Data_Process.py
ββ README.md
@misc{li2025personalizedconversationalbenchmarksimulating,
title={A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations},
author={Li Li and Peilin Cai and Ryan A. Rossi and Franck Dernoncourt and Branislav Kveton and Junda Wu and Tong Yu and Linxin Song and Tiankai Yang and Yuehan Qin and Nesreen K. Ahmed and Samyadeep Basu and Subhojyoti Mukherjee and Ruiyi Zhang and Zhengmian Hu and Bo Ni and Yuxiao Zhou and Zichao Wang and Yue Huang and Yu Wang and Xiangliang Zhang and Philip S. Yu and Xiyang Hu and Yue Zhao},
year={2025},
eprint={2505.14106},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14106}
}Feel free to make an issue or send me email

