mlfoundations · asuvarna31 · Jan 28, 2025 · Jan 28, 2025 · Jan 28, 2025 · Jan 28, 2025
diff --git a/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/README.md b/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/README.md
@@ -0,0 +1,45 @@
+## BenchBuilder
+An automatic pipeline to create high-quality benchmark. BenchBuilder was used on Chatbot Arena data to curate Arena-Hard-v0.1.
+
+Checkout our [paper](https://arxiv.org/abs/2406.11939) for more details.
+
+![BenchBuilder Pipeline](../misc/pipeline_method.png)
+
+BenchBuilder employs a two stage pipeline.
+
+First, install the BenchBuilder dependencies:
+```console
+cd BenchBuilder
+pip install -r requirements.txt
+```
+
+Step 1: annotate the prompt using GPT-3.5-Turbo and filter prompts which either have a score < 5 or belong to a topic cluster with a mean score < 3. This serves as a cheap and first pass through to remove any low quality prompts and clusters before further curation. 
+
+Step 2: use GPT-4-Turbo to annotate the remaining prompts, then extract prompts with quality score of >= 6 and belong to a topic cluster with mean quality score >= 6, ensuring only high-quality prompts are selected with minimal false positives.
+
+After BenchBuilder, we stratified sampled multiple prompts per cluster to create a benchmark. However, you may employ whatever sampling scheme on prompts produced by BenchBuilder.
+
+For Chatbot Arena Category Hard Prompts, which also employed BenchBuilder, we used Llama-3-70B-Instruct as LLM annotators. Check out our Category Hard Prompt [blogpost](https://lmsys.org/blog/2024-05-17-category-hard/) for more detail.
+
+To topic cluster your dataset:
+```console
+python topic_clustering.py --conv-file [your json file] --min-topic-size 8
+```
+
+To annotate your dataset with key criteria:
+```console
+python label.py --config config.yaml
+```
+Make sure to properly configure your `config.yaml` before begin labeling.
+
+To filter prompts based on scores and cluster thresholds:
+```console
+python filter.py --conversations_file [your jsonl file] --clusters_file [your json file] --prompt_threshold 6 --cluster_threshold 3
+```
+
+We also employ BenchBuilder on [allenai/WildChat-1M](https://huggingface.co/datasets/allenai/WildChat-1M) and produce 250 high-quality prompts, Wild-Hard-250. We evaluate 10 of the 20 models outlined in the paper on Wild-Hard-250 and a random sample of 250 prompts from Wild-Chat dataset using GPT-4-Turbo as judge.
+
+|    | Wild-Hard-250 | Wild-Chat-Random
+| --- | ---- | ----
+| Spearman Correlation |	93.6	|		38.2
+| Kendall Tau |	85.5	|		27.3
diff --git a/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/category.py b/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/category.py
@@ -0,0 +1,66 @@
+# Tag structure
+# - category_tag
+#     - criteria_v0.1
+#         - specificity
+#         - ...
+#     - math_v0.1
+#         - math
+#     - if_v0.1
+#         - if
+#         - score
+import ast
+import re
+
+
+class Category:
+    def __init__(self):
+        pass
+
+    @staticmethod
+    def create_category(name):
+        if name == "criteria_v0.1":
+            return CategoryHardPrompt()
+        raise Exception(f"Category name is incorrect: {name}")
+
+    def post_process(self):
+        pass
+
+
+class CategoryHardPrompt(Category):
+    def __init__(self):
+        super().__init__()
+        self.name_tag = "criteria_v0.1"
+        self.pattern = re.compile(r"(\[[1234567](?:\,\s[1234567])*\])")
+        self.sys_prompt = "Your task is to evaluate how well the following input prompts can assess the capabilities of advanced AI assistants.\n\nFor the input prompt, please analyze it based on the following 7 criteria.\n1. Specificity: Does the prompt ask for a specific output, such as code, a mathematical solution, a logical simplification, a problem-solving strategy, or a hardware setup recommendation? This specificity allows the AI to demonstrate its ability to understand and generate precise responses.\n2. Domain Knowledge: Does the prompt cover a specific domain, such as programming, mathematics, logic, problem-solving, or hardware setup? Prompts spanning a range of topics test the AI's breadth of knowledge and its ability to apply that knowledge to different domains.\n3. Complexity: Does the prompt vary in complexity, from straightforward tasks to more complex, multi-step problems? This allows evaluators to assess the AI's capability to handle problems of varying difficulty.\n4. Problem-Solving Skills: Does the prompt directly involves the AI to demonstrate active problem-solving skills, such systemically coming up with a solution for a specific setup instead of regurgitating an existing fact? This tests the AI's ability to apply logical reasoning and provide practical solutions.\n5. Creativity: Does the prompt involve a level of creativity in approaching the problem? This criterion tests the AI's ability to provide tailored solutions that take into account the user's specific needs and limitations.\n6. Technical Accuracy: Does the prompt require technical accuracy in the response? This allows evaluators to assess the AI's precision and correctness in technical fields.\n7. Real-world Application: Does the prompt relate to real-world applications, such as setting up a functional system or writing code for a practical use case? This tests the AI's ability to provide practical and actionable information that could be implemented in real-life scenarios.\n\nYou must list the criteria numbers that the prompt satisfies in the format of a Python array. For example, \"[...]\". Do not explain your choice."
+        self.tags = {
+            1: "specificity",
+            2: "domain_knowledge",
+            3: "complexity",
+            4: "problem_solving",
+            5: "creativity",
+            6: "technical_accuracy",
+            7: "real_world",
+        }
+
+    def get_score(self, judgment):
+        matches = self.pattern.findall(judgment)
+        matches = [m for m in matches if m != ""]
+        if len(set(matches)) == 0:
+            return ['No Match']
+        elif len(set(matches)) == 1:
+            try:
+                return ast.literal_eval(matches[0])
+            except SyntaxError:
+                print(matches[0])
+                return ['Syntax Error']
+        else:
+            return ['Multiple Match']
+
+    def pre_process(self, prompt):
+        conv = [{"role": "system", "content": self.sys_prompt}]
+        conv.append({"role": "user", "content": prompt})
+        return conv
+
+    def post_process(self, judgment):
+        criteria = self.get_score(judgment=judgment)
+        return {name: bool(i in criteria) for i, name in self.tags.items()}
diff --git a/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/config.yaml b/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/config.yaml
@@ -0,0 +1,21 @@
+# Yaml config file for category classification
+
+input_file: null # json
+cache_file: null # json
+output_file: null # json line
+
+convert_to_json: True
+
+task_name:
+  - criteria_v0.1
+
+model_name: gpt-3.5-turbo-0125
+endpoints: null
+parallel: 8
+temperature: 0.0
+max_token: 32
+api_type: openai
+
+max_retry: 2
+retry_sleep: 10
+error_output: $ERROR$
diff --git a/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/embed.py b/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/embed.py
@@ -0,0 +1,26 @@
+import pandas as pd
+import numpy as np
+import pickle
+import argparse
+import torch
+
+from sentence_transformers import SentenceTransformer, util
+from tqdm import tqdm
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--file", type=str, required=True)
+    args = parser.parse_args()
+    print(args)
+
+    transformer = SentenceTransformer("all-MiniLM-L6-v2", device='cuda')
+
+    data = pd.read_json(args.file)
+    print(len(data))
+
+    ids = data.question_id
+    prompts = data.turns.map(lambda x: x[0]["content"])
+
+    embeddings = transformer.encode(prompts.tolist(), convert_to_tensor=True, batch_size=8192, show_progress_bar=True)
+    torch.save(embeddings, 'embeddings.pt')
diff --git a/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/filter.py b/eval/chat_benchmarks/arena_hard_auto/BenchBuilder/filter.py
@@ -0,0 +1,151 @@
+"""
+Filter prompts based on scores and cluster thresholds. To be run after topic_clustering.py and label.py
+"""
+import hashlib
+import os
+
+import orjson
+import json
+import argparse
+from typing import List, Dict
+import numpy as np
+import wandb
+
+def load_json(file_path: str) -> List[Dict]:
+    with open(file_path, 'rb') as f:
+        return orjson.loads(f.read())
+
+def load_jsonl(file_path: str) -> List[Dict]:
+    conversations = []
+    with open(file_path, 'rb') as f:
+        for line in f:
+            conversations.append(orjson.loads(line))
+    return conversations
+
+def calculate_score(conversation: Dict) -> int:
+    criteria = conversation.get('category_tag', {}).get('criteria_v0.1', {})
+    return sum(1 for value in criteria.values() if value)
+
+def calculate_cluster_scores(conversations: List[Dict], clusters: List[int]) -> Dict[int, float]:
+    cluster_scores = {}
+    for conv, cluster in zip(conversations, clusters):
+        score = calculate_score(conv)
+        if cluster not in cluster_scores:
+            cluster_scores[cluster] = []
+        cluster_scores[cluster].append(score)
+
+    cluster_to_mean_score = {cluster: np.mean(scores) for cluster, scores in cluster_scores.items()}
+    print(f"Cluster to mean score: {cluster_to_mean_score}")
+    return cluster_to_mean_score
+
+def filter_prompts(conversations: List[Dict], clusters: List[int], prompt_threshold: int, cluster_threshold: float) -> List[Dict]:
+    cluster_scores = calculate_cluster_scores(conversations, clusters)
+
+    filtered_prompts = []
+    for conv, cluster in zip(conversations, clusters):
+        score = calculate_score(conv)
+        if score >= prompt_threshold and cluster_scores[cluster] >= cluster_threshold:
+            conv.update({
+                "prompt_score": score,
+            })
+            filtered_prompts.append(conv)
+
+    return filtered_prompts
+
+def to_arena_hard_questions_format(conversations: List[Dict], clusters: List[int], topics_file: str, image_dir: str) -> List[Dict]:
+    """
+    Convert to a format like this:
+    {"question_id":"328c149ed45a41c0b9d6f14659e63599",
+     "category":"arena-hard-v0.1",
+     "cluster":"ABC Sequence Puzzles & Groups",
+     "turns":[{"content":"Use ABC notation to write a melody in the style of a folk tune."}]
+    }
+    """
+
+    topics_map = load_json(topics_file)
+    cluster_number_to_name: Dict[str, str] = {}
+    for cluster_number, cluster_obj in topics_map["topic_aspects"]["OpenAI"].items():
+        cluster_number_to_name[cluster_number] = cluster_obj[0][0]
+
+    arena_hard_questions = []
+    for i, (conv, cluster) in enumerate(zip(conversations, clusters)):
+        # Contains image
+        if isinstance(conv["conversation_a"][0]["content"], list):
+            image_hash = conv["conversation_a"][0]["content"][1][0]
+            image_path = os.path.join(image_dir, f"{image_hash}.png")
+            is_image_valid = os.path.exists(image_path)
+            if not is_image_valid:
+                print(f"Image not found: {image_path}, not included in benchmark.")
+                continue
+
+        turns_list = []
+        turns_list.append({"content": conv["conversation_a"][0]["content"]})
+
+        arena_hard_questions.append({
+            "question_id": f"{i}",
+            "category": "arena-hard-v0.1",
+            "cluster": cluster_number_to_name[str(cluster)],
+            "turns": turns_list
+        })
+
+    return arena_hard_questions
+
+def to_wandb_table(conversations: List[Dict], image_dir: str) -> wandb.Table:
+    data = []
+    columns = ["question", "image", "prompt_score"]
+    for conv in conversations:
+        # conv["conversation_a"][0] is the first turn of the conversation 
+        # conv["conversation_a"][0]["content"][1][0] is indexing to the first index of the images
+        if isinstance(conv["conversation_a"][0]["content"], list):
+            question = conv["conversation_a"][0]["content"][0]
+
+            # Take the first image
+            image_hash = conv["conversation_a"][0]["content"][1][0]
+            image_path = os.path.join(image_dir, f"{image_hash}.png")
+            wandb_image = image_path
+            if not os.path.exists(image_path):
+                print(f"Image not found: {image_path}, not included in WANDB.")
+                continue
+            wandb_image = wandb.Image(image_path)
+            data.append([question, wandb_image, conv["prompt_score"]])
+        elif isinstance(conv["conversation_a"][0]["content"], str):
+            question = conv["conversation_a"][0]["content"]
+            data.append([question, conv["prompt_score"]])
+
+    return wandb.Table(data=data, columns=columns)
+
+def main():
+    parser = argparse.ArgumentParser(description='Filter prompts based on scores and cluster thresholds.')
+    parser.add_argument('--conversations_file', type=str, help='Path to the JSONL file containing conversations')
+    parser.add_argument('--clusters_file', type=str, help='Path to the JSON file containing cluster assignments')
+    parser.add_argument("--image_dir", type=str, help="Path to the directory containing images")
+    parser.add_argument('--prompt_threshold', type=int, default=5, help='Minimum score threshold for individual prompts')
+    parser.add_argument('--cluster_threshold', type=int, default=3, help='Minimum average score threshold for clusters')
+    parser.add_argument('--output_file', type=str, default='filtered_prompts.json', help='Path to save the filtered prompts')
+    parser.add_argument('--wandb_project', type=str, default='arena-hard-auto', help='Wandb project name')
+    parser.add_argument("--topics_file", type=str, default="topics.json", help="Path to the file containing topic cluster numbers to names mapping")
+
+    args = parser.parse_args()
+
+    if args.wandb_project:
+        wandb.init(project=args.wandb_project)
+
+    conversations = load_jsonl(args.conversations_file)
+    clusters = load_json(args.clusters_file)
+
+    filtered_prompts = filter_prompts(conversations, clusters, args.prompt_threshold, args.cluster_threshold)
+
+    arena_hard_questions = to_arena_hard_questions_format(filtered_prompts, clusters, args.topics_file, args.image_dir)
+
+    with open(args.output_file, "w") as f:
+        for question in arena_hard_questions:
+            f.write(json.dumps(question) + "\n")
+
+    print(f"Filtered {len(filtered_prompts)} prompts out of {len(conversations)} total.")
+    print(f"Results saved to {args.output_file}")
+
+    if args.wandb_project:
+        wandb.log({"filtered_prompts": to_wandb_table(filtered_prompts, args.image_dir)})
+
+if __name__ == "__main__":
+    main()