Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions eval/chat_benchmarks/arena_hard_auto/BenchBuilder/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
## BenchBuilder
An automatic pipeline to create high-quality benchmark. BenchBuilder was used on Chatbot Arena data to curate Arena-Hard-v0.1.

Checkout our [paper](https://arxiv.org/abs/2406.11939) for more details.

![BenchBuilder Pipeline](../misc/pipeline_method.png)

BenchBuilder employs a two stage pipeline.

First, install the BenchBuilder dependencies:
```console
cd BenchBuilder
pip install -r requirements.txt
```

Step 1: annotate the prompt using GPT-3.5-Turbo and filter prompts which either have a score < 5 or belong to a topic cluster with a mean score < 3. This serves as a cheap and first pass through to remove any low quality prompts and clusters before further curation.

Step 2: use GPT-4-Turbo to annotate the remaining prompts, then extract prompts with quality score of >= 6 and belong to a topic cluster with mean quality score >= 6, ensuring only high-quality prompts are selected with minimal false positives.

After BenchBuilder, we stratified sampled multiple prompts per cluster to create a benchmark. However, you may employ whatever sampling scheme on prompts produced by BenchBuilder.

For Chatbot Arena Category Hard Prompts, which also employed BenchBuilder, we used Llama-3-70B-Instruct as LLM annotators. Check out our Category Hard Prompt [blogpost](https://lmsys.org/blog/2024-05-17-category-hard/) for more detail.

To topic cluster your dataset:
```console
python topic_clustering.py --conv-file [your json file] --min-topic-size 8
```

To annotate your dataset with key criteria:
```console
python label.py --config config.yaml
```
Make sure to properly configure your `config.yaml` before begin labeling.

To filter prompts based on scores and cluster thresholds:
```console
python filter.py --conversations_file [your jsonl file] --clusters_file [your json file] --prompt_threshold 6 --cluster_threshold 3
```

We also employ BenchBuilder on [allenai/WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) and produce 250 high-quality prompts, Wild-Hard-250. We evaluate 10 of the 20 models outlined in the paper on Wild-Hard-250 and a random sample of 250 prompts from Wild-Chat dataset using GPT-4-Turbo as judge.

| | Wild-Hard-250 | Wild-Chat-Random
| --- | ---- | ----
| Spearman Correlation | 93.6 | 38.2
| Kendall Tau | 85.5 | 27.3
66 changes: 66 additions & 0 deletions eval/chat_benchmarks/arena_hard_auto/BenchBuilder/category.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Tag structure
# - category_tag
# - criteria_v0.1
# - specificity
# - ...
# - math_v0.1
# - math
# - if_v0.1
# - if
# - score
import ast
import re


class Category:
def __init__(self):
pass

@staticmethod
def create_category(name):
if name == "criteria_v0.1":
return CategoryHardPrompt()
raise Exception(f"Category name is incorrect: {name}")

def post_process(self):
pass


class CategoryHardPrompt(Category):
def __init__(self):
super().__init__()
self.name_tag = "criteria_v0.1"
self.pattern = re.compile(r"(\[[1234567](?:\,\s[1234567])*\])")
self.sys_prompt = "Your task is to evaluate how well the following input prompts can assess the capabilities of advanced AI assistants.\n\nFor the input prompt, please analyze it based on the following 7 criteria.\n1. Specificity: Does the prompt ask for a specific output, such as code, a mathematical solution, a logical simplification, a problem-solving strategy, or a hardware setup recommendation? This specificity allows the AI to demonstrate its ability to understand and generate precise responses.\n2. Domain Knowledge: Does the prompt cover a specific domain, such as programming, mathematics, logic, problem-solving, or hardware setup? Prompts spanning a range of topics test the AI's breadth of knowledge and its ability to apply that knowledge to different domains.\n3. Complexity: Does the prompt vary in complexity, from straightforward tasks to more complex, multi-step problems? This allows evaluators to assess the AI's capability to handle problems of varying difficulty.\n4. Problem-Solving Skills: Does the prompt directly involves the AI to demonstrate active problem-solving skills, such systemically coming up with a solution for a specific setup instead of regurgitating an existing fact? This tests the AI's ability to apply logical reasoning and provide practical solutions.\n5. Creativity: Does the prompt involve a level of creativity in approaching the problem? This criterion tests the AI's ability to provide tailored solutions that take into account the user's specific needs and limitations.\n6. Technical Accuracy: Does the prompt require technical accuracy in the response? This allows evaluators to assess the AI's precision and correctness in technical fields.\n7. Real-world Application: Does the prompt relate to real-world applications, such as setting up a functional system or writing code for a practical use case? This tests the AI's ability to provide practical and actionable information that could be implemented in real-life scenarios.\n\nYou must list the criteria numbers that the prompt satisfies in the format of a Python array. For example, \"[...]\". Do not explain your choice."
self.tags = {
1: "specificity",
2: "domain_knowledge",
3: "complexity",
4: "problem_solving",
5: "creativity",
6: "technical_accuracy",
7: "real_world",
}

def get_score(self, judgment):
matches = self.pattern.findall(judgment)
matches = [m for m in matches if m != ""]
if len(set(matches)) == 0:
return ['No Match']
elif len(set(matches)) == 1:
try:
return ast.literal_eval(matches[0])
except SyntaxError:
print(matches[0])
return ['Syntax Error']
else:
return ['Multiple Match']

def pre_process(self, prompt):
conv = [{"role": "system", "content": self.sys_prompt}]
conv.append({"role": "user", "content": prompt})
return conv

def post_process(self, judgment):
criteria = self.get_score(judgment=judgment)
return {name: bool(i in criteria) for i, name in self.tags.items()}
21 changes: 21 additions & 0 deletions eval/chat_benchmarks/arena_hard_auto/BenchBuilder/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Yaml config file for category classification

input_file: null # json
cache_file: null # json
output_file: null # json line

convert_to_json: True

task_name:
- criteria_v0.1

model_name: gpt-3.5-turbo-0125
endpoints: null
parallel: 8
temperature: 0.0
max_token: 32
api_type: openai

max_retry: 2
retry_sleep: 10
error_output: $ERROR$
26 changes: 26 additions & 0 deletions eval/chat_benchmarks/arena_hard_auto/BenchBuilder/embed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import pandas as pd
import numpy as np
import pickle
import argparse
import torch

from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--file", type=str, required=True)
args = parser.parse_args()
print(args)

transformer = SentenceTransformer("all-MiniLM-L6-v2", device='cuda')

data = pd.read_json(args.file)
print(len(data))

ids = data.question_id
prompts = data.turns.map(lambda x: x[0]["content"])

embeddings = transformer.encode(prompts.tolist(), convert_to_tensor=True, batch_size=8192, show_progress_bar=True)
torch.save(embeddings, 'embeddings.pt')
151 changes: 151 additions & 0 deletions eval/chat_benchmarks/arena_hard_auto/BenchBuilder/filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
"""
Filter prompts based on scores and cluster thresholds. To be run after topic_clustering.py and label.py
"""
import hashlib
import os

import orjson
import json
import argparse
from typing import List, Dict
import numpy as np
import wandb

def load_json(file_path: str) -> List[Dict]:
with open(file_path, 'rb') as f:
return orjson.loads(f.read())

def load_jsonl(file_path: str) -> List[Dict]:
conversations = []
with open(file_path, 'rb') as f:
for line in f:
conversations.append(orjson.loads(line))
return conversations

def calculate_score(conversation: Dict) -> int:
criteria = conversation.get('category_tag', {}).get('criteria_v0.1', {})
return sum(1 for value in criteria.values() if value)

def calculate_cluster_scores(conversations: List[Dict], clusters: List[int]) -> Dict[int, float]:
cluster_scores = {}
for conv, cluster in zip(conversations, clusters):
score = calculate_score(conv)
if cluster not in cluster_scores:
cluster_scores[cluster] = []
cluster_scores[cluster].append(score)

cluster_to_mean_score = {cluster: np.mean(scores) for cluster, scores in cluster_scores.items()}
print(f"Cluster to mean score: {cluster_to_mean_score}")
return cluster_to_mean_score

def filter_prompts(conversations: List[Dict], clusters: List[int], prompt_threshold: int, cluster_threshold: float) -> List[Dict]:
cluster_scores = calculate_cluster_scores(conversations, clusters)

filtered_prompts = []
for conv, cluster in zip(conversations, clusters):
score = calculate_score(conv)
if score >= prompt_threshold and cluster_scores[cluster] >= cluster_threshold:
conv.update({
"prompt_score": score,
})
filtered_prompts.append(conv)

return filtered_prompts

def to_arena_hard_questions_format(conversations: List[Dict], clusters: List[int], topics_file: str, image_dir: str) -> List[Dict]:
"""
Convert to a format like this:
{"question_id":"328c149ed45a41c0b9d6f14659e63599",
"category":"arena-hard-v0.1",
"cluster":"ABC Sequence Puzzles & Groups",
"turns":[{"content":"Use ABC notation to write a melody in the style of a folk tune."}]
}
"""

topics_map = load_json(topics_file)
cluster_number_to_name: Dict[str, str] = {}
for cluster_number, cluster_obj in topics_map["topic_aspects"]["OpenAI"].items():
cluster_number_to_name[cluster_number] = cluster_obj[0][0]

arena_hard_questions = []
for i, (conv, cluster) in enumerate(zip(conversations, clusters)):
# Contains image
if isinstance(conv["conversation_a"][0]["content"], list):
image_hash = conv["conversation_a"][0]["content"][1][0]
image_path = os.path.join(image_dir, f"{image_hash}.png")
is_image_valid = os.path.exists(image_path)
if not is_image_valid:
print(f"Image not found: {image_path}, not included in benchmark.")
continue

turns_list = []
turns_list.append({"content": conv["conversation_a"][0]["content"]})

arena_hard_questions.append({
"question_id": f"{i}",
"category": "arena-hard-v0.1",
"cluster": cluster_number_to_name[str(cluster)],
"turns": turns_list
})

return arena_hard_questions

def to_wandb_table(conversations: List[Dict], image_dir: str) -> wandb.Table:
data = []
columns = ["question", "image", "prompt_score"]
for conv in conversations:
# conv["conversation_a"][0] is the first turn of the conversation
# conv["conversation_a"][0]["content"][1][0] is indexing to the first index of the images
if isinstance(conv["conversation_a"][0]["content"], list):
question = conv["conversation_a"][0]["content"][0]

# Take the first image
image_hash = conv["conversation_a"][0]["content"][1][0]
image_path = os.path.join(image_dir, f"{image_hash}.png")
wandb_image = image_path
if not os.path.exists(image_path):
print(f"Image not found: {image_path}, not included in WANDB.")
continue
wandb_image = wandb.Image(image_path)
data.append([question, wandb_image, conv["prompt_score"]])
elif isinstance(conv["conversation_a"][0]["content"], str):
question = conv["conversation_a"][0]["content"]
data.append([question, conv["prompt_score"]])

return wandb.Table(data=data, columns=columns)

def main():
parser = argparse.ArgumentParser(description='Filter prompts based on scores and cluster thresholds.')
parser.add_argument('--conversations_file', type=str, help='Path to the JSONL file containing conversations')
parser.add_argument('--clusters_file', type=str, help='Path to the JSON file containing cluster assignments')
parser.add_argument("--image_dir", type=str, help="Path to the directory containing images")
parser.add_argument('--prompt_threshold', type=int, default=5, help='Minimum score threshold for individual prompts')
parser.add_argument('--cluster_threshold', type=int, default=3, help='Minimum average score threshold for clusters')
parser.add_argument('--output_file', type=str, default='filtered_prompts.json', help='Path to save the filtered prompts')
parser.add_argument('--wandb_project', type=str, default='arena-hard-auto', help='Wandb project name')
parser.add_argument("--topics_file", type=str, default="topics.json", help="Path to the file containing topic cluster numbers to names mapping")

args = parser.parse_args()

if args.wandb_project:
wandb.init(project=args.wandb_project)

conversations = load_jsonl(args.conversations_file)
clusters = load_json(args.clusters_file)

filtered_prompts = filter_prompts(conversations, clusters, args.prompt_threshold, args.cluster_threshold)

arena_hard_questions = to_arena_hard_questions_format(filtered_prompts, clusters, args.topics_file, args.image_dir)

with open(args.output_file, "w") as f:
for question in arena_hard_questions:
f.write(json.dumps(question) + "\n")

print(f"Filtered {len(filtered_prompts)} prompts out of {len(conversations)} total.")
print(f"Results saved to {args.output_file}")

if args.wandb_project:
wandb.log({"filtered_prompts": to_wandb_table(filtered_prompts, args.image_dir)})

if __name__ == "__main__":
main()
Loading
Loading