Skip to content

Latest commit

 

History

History
466 lines (401 loc) · 18.2 KB

File metadata and controls

466 lines (401 loc) · 18.2 KB

PERSONA Benchmark

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

PERSONA-Bench Logo
Click the logo to read our paper on arXiv

GitHub stars License arXiv

PERSONA Benchmark (PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ) is the first large-scale benchmark that integrates personalization and multi-turn conversation.

PERSONA-Bench
In-context prompt construction for personalized conversational inference
PERSONA-Bench
Performance of GPT-4.1 on PERSONA benchmark


🚀 Recent News

  • 🗓️2025-05-20  Initial code and dataset released
  • 🗓️2025-05-23  Our preprint arXiv paper released

📖 Table of Contents

  1. Why PERSONA Benchmark?
  2. Key Features
  3. Installation
  4. Quick Start
  5. Dataset
  6. Tested Models
  7. Results
  8. Project Structure
  9. Citation
  10. Contact

Why PERSONA Benchmark?

Traditional personalization datasets ignore dialogue structure, while classic multi-turn benchmarks treat users as anonymous. PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ bridges this gap:

  • Personalized and Conversational – Each task conditions on both the evolving thread and the author’s history.
  • Realistic Scale – 19 k posts, 111 k conversations, 3 800 + users across 10 domains.
  • Three Task Families – Classification, regression, and generation allow holistic evaluation.
  • Strong Baselines – GPT‑4.1, Claude‑3.5, LLaMA‑3, DeepSeek-R1 etc.

Key Features

🌟 Feature Description
P‑Conv / P‑NonConv / NP‑Conv Isolate the benefit of personalization and conversation structure
Domain Diversity 10 Reddit domains from worldnews to gaming
Unified Prompting Pipeline Few‑shot templates for zero‑shot evaluation
Open-Source & Reproducible MIT-licensed code, Hugging Face mirror, one-command demo

Installation

1 · Create conda environment

conda create -n persona-bench python=3.10 -y
conda activate persona-bench
conda install -c conda-forge git git-lfs -y

2 · Install dependencies

python -m pip install --upgrade pip
pip install -r requirements.txt

3 · Download the Dataset

You can obtain PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ with either of the methods below.
Method A (Google Drive) is recommended because it ships the exact splits used in our experiments.

A. Google Drive — ready‑to‑use release (recommended)

click google cloud to directly download our processed dataset

# 1) Download the required file
Please download the file 'persona_bench_release.zip' from the following link:
https://drive.google.com/file/d/1d8Ju2t0Aa_Nqi5r_cybjvHudYamAN9NX/view?usp=share_link
After downloading, please move this file into a 'data/PERSONA_Bench/' directory relative to where you will run the subsequent commands.
# For example, if you are in your project's root directory, you would save it as:
# [Your Project Root]/data/PERSONA_Bench/merged_posts_remapped.json
# 2) Create the target directory (if you haven't manually created it for the download)
mkdir -p data/PERSONA_Bench
# 3) Verify the downloaded file is in the correct location
Ensure 'merged_posts_remapped.json' is located at 'data/PERSONA_Bench/merged_posts_remapped.json'
# (Optional) If users still want to use gdown after being shown the link,
# You can include this as an alternative:
# Alternative using gdown (if you have it installed):
python -m pip install --upgrade gdown  # Install gdown if not already installed
mkdir -p data/PERSONA_Bench
gdown --id 1d8Ju2t0Aa_Nqi5r_cybjvHudYamAN9NX -O data/PERSONA_Bench/merged_posts_remapped.json

The archive already contains the processed train / dev / test splits, so no further steps are required.


B. Hugging Face — raw records (requires preprocessing)

# Install Git LFS for large‑file support
git lfs install

# Clone the official dataset repository
git clone https://huggingface.co/datasets/PERSONABench/PERSONA-Bench               data/Raw_Data_Postized.json

Next, convert the raw JSON files to the canonical format with
Raw_Data_Process.py: The script writes the processed dataset to data/PERSONA_Bench/, resulting in the same directory structure as Method A.


Quick Start

1 · Demo-level evaluation (quick sanity check)

# ── Sentiment classification ───────────────────────────────────────────────
python demo/3.1PromptMaker.py    # 👈 first open the script and fill in your API key
python demo/GPT3.1.py            #    also adjust INPUT_JSON_FILE if you moved the data

# ── Regression ────────────────────────────────────────────────────────────
python demo/3.2PromptMaker.py    # same edits as above
python demo/GPT3.2.py

# ── Generation ────────────────────────────────────────────────────────────
python demo/3.3PromptMaker.py
python demo/GPT3.3.py

⚠️ API credentials needed
Each demo/GPT*.py script expects an API key (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY).
Edit the file or export the key in your shell before running.


2 · Full-scale evaluation with the official pipeline

Below we use PromptMakers/3.3PromptMaker.py and LLMEvaluators/3.3/GPT4.1.py.
Adopt the same pattern for other tasks / models.

# ① Prepare prompts
python PromptMakers/3.3PromptMaker.py \
    --input_json_file data/splits/generation_test.jsonl      # 👈 change to your path

# ② Run the LLM
python LLMEvaluators/3.3/GPT4.1.py \
    --input_json_file outputs/prompt_3.3_generation.jsonl \  # 👈 result of step ①
    --output_jsonl_file outputs/gpt4.1_generation.jsonl      # 👈 where to save generations

Required edits

File What to change
PromptMakers/3.3PromptMaker.py INPUT_JSON_FILE → point to the dataset split you want
LLMEvaluators/3.3/GPT4.1.py INPUT_JSON_FILE / OUTPUT_JSONL_FILE → prompt file from step ① & save-path
LLMEvaluators/3.3/GPT4.1.py API key / base URL section at the top

3 · Post-hoc metric computation

If you evaluate DeepSeek or Llama models with the log-only workflow, run the log-level evaluator:

python LLMEvaluators/LogEvaluator/LogEval3.3.py \
    --input_jsonl_file PATH_TO_YOUR_JSONL   # 👈 also update inside the script

Open LogEvaluator and update INPUT_JSONL_FILE (plus any other paths) before running.


Tip: add the project root to PYTHONPATH or run every command from the repo root so that all relative imports resolve correctly.


Dataset

Field Value
Name PᴇʀꜱᴏɴᴀCᴏɴᴠBᴇɴᴄʜ
Size 19 215 posts · 111 239 conversations · 3 878 users
Domains See at table
License Research‑only (Reddit TOS)

Domain Style Conversational Engagement Conversation Purpose User Interactivity
Worldnews Formal Debate-Driven Education Low
Science Debate-Driven Education Medium
Politics Debate-Driven Education Low
Technology Information Sharing Education Medium
Gaming Casual Community-Based Entertainment High
Life Community-Based Socializing Medium
Movies Opinion-Based Entertainment Medium
Books Opinion-Based Entertainment High
Entrepreneur Motivational Supportive Advice/Support High
Art Creative Community-Based Entertainment Medium

Tested Models

Model Year Type
GPT-4.1 2025 Commercial
GPT-4o-mini 2024 Commercial
GPT-4o 2024 Commercial
GPT-4.1-mini 2025 Commercial
Claude-3.5 Sonnet 2024 Commercial
Gemini 2.5 Pro 2025 Commercial
LLaMA-3 80B-Instruct 2024 OSS
DeepSeek-R1 2025 OSS
DeepSeek-V3 2025 OSS
Mistral 2024 OSS
o3 2025 Commercial
o4-mini 2025 Commercial

Results

Task Metric Performance Metrics
GPT-4.1GPT-4o-miniClaude-3.5LLaMA3-3DeepSeek-R1
(P-Conv (Ours) | P-NonConv)
Sentiment 
Classification
Accuracy 0.9122 | 0.7862 0.6875 | 0.6562 0.9109 | 0.8192 0.8458 | 0.7305 0.8853 | 0.7092
F1 0.9481 | 0.8720 0.7895 | 0.7640 0.9474 | 0.8908 0.8401 | 0.7495 0.8848 | 0.7362
MCC 0.6770 | 0.2266 0.2268 | 0.1870 0.6721 | 0.3666 0.4420 | 0.2333 0.6070 | 0.2586
Impact 
Forecasting
RMSE 310.09 | 350.29 310.23 | 351.50 282.48 | 344.75 319.83 | 350.43 300.03 | 353.80
MAE 97.52 | 113.46 97.64 | 115.80 85.39 | 109.27 101.25 | 113.18 89.59 | 112.52
Next-Text 
Generation
ROUGE-1 0.2777 | 0.2248 0.2491 | 0.2121 0.2161 | 0.1645 0.2055 | 0.1540 0.1786 | 0.1359
ROUGE-L 0.2115 | 0.1565 0.1906 | 0.1470 0.1719 | 0.1130 0.1572 | 0.1009 0.1395 | 0.0911
METEOR 0.2677 | 0.2316 0.2120 | 0.2198 0.1913 | 0.1636 0.1838 | 0.1659 0.1649 | 0.1401
BLEU 0.0604 | 0.0206 0.0330 | 0.0170 0.0549 | 0.0123 0.0480 | 0.0089 0.0423 | 0.0083
SBERT 0.4757 | 0.4322 0.4381 | 0.3982 0.3942 | 0.3512 0.3733 | 0.3339 0.3699 | 0.3307

Project Structure

├─ demo/
├─ LLMEvaluators/
├─ PromptMakers/
├─ data/
├─ config/
├─ requirements.txt
├─ Raw_Data_Process.py
└─ README.md

Citation

Thank you for your interest in our work!
@misc{li2025personalizedconversationalbenchmarksimulating,
  title={A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations},
  author={Li Li and Peilin Cai and Ryan A. Rossi and Franck Dernoncourt and Branislav Kveton and Junda Wu and Tong Yu and Linxin Song and Tiankai Yang and Yuehan Qin and Nesreen K. Ahmed and Samyadeep Basu and Subhojyoti Mukherjee and Ruiyi Zhang and Zhengmian Hu and Bo Ni and Yuxiao Zhou and Zichao Wang and Yue Huang and Yu Wang and Xiangliang Zhang and Philip S. Yu and Xiyang Hu and Yue Zhao},
  year={2025},
  eprint={2505.14106},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.14106}
}

Contact

Feel free to make an issue or send me email