[feat] add graphwalks (#3377)

jannalulu · web-flow · commit 0563daa3c651 · 2025-11-10T19:41:10.000Z
diff --git a/lm_eval/tasks/README.md b/lm_eval/tasks/README.md
@@ -73,6 +73,7 @@ provided to the individual README.md files for each subfolder.
 | [global_piqa](global_piqa/README.md)                                     | Multilingual (non-parallel) commonsense reasoning benchmark covering 116 language varieties with culturally-specific examples from 65 countries                                                                                                                                                                                        | Multiple (116 languages)   **Human authored**                                                                                                                                                                                                                 |
 | [glue](glue/README.md)                                                   | General Language Understanding Evaluation benchmark to test broad language abilities.                                                                                                                                                                                                                                                  | English                                                                                                                                                                                                                                                       |
 | [gpqa](gpqa/README.md)                                                   | Tasks designed for general public question answering and knowledge verification.                                                                                                                                                                                                                                                       | English                                                                                                                                                                                                                                                       |
+| [graphwalks](graphwalks/README.md)                                       | A multi-hop reasoning long-context benchmark                                                                                                                                                                                                                                                                                           | English                                                                                                                                                                                                                                                       |
 | [gsm8k](gsm8k/README.md)                                                 | A benchmark of grade school math problems aimed at evaluating reasoning capabilities.                                                                                                                                                                                                                                                  | English                                                                                                                                                                                                                                                       |
 | [groundcocoa](groundcocoa/README.md)                                     | A benchmark evaluating the conditional and compositional reasoning of language models using a grounding task.                                                                                                                                                                                                                          | English                                                                                                                                                                                                                                                       |
 | [haerae](haerae/README.md)                                               | Tasks focused on assessing detailed factual and historical knowledge.                                                                                                                                                                                                                                                                  | Korean                                                                                                                                                                                                                                                        |
diff --git a/lm_eval/tasks/graphwalks/README.md b/lm_eval/tasks/graphwalks/README.md
@@ -0,0 +1,34 @@
+# GraphWalks: a multi hop reasoning long context benchmark
+In Graphwalks, the model is given a graph represented by its edge list and asked to perform an operation.
+
+### Dataset
+
+HuggingFace: https://huggingface.co/datasets/openai/graphwalks
+
+### Groups and Tasks
+
+#### Groups
+
+* `graphwalks`: Run both `graphwalks_128k` and `graphwalks_1M`
+
+#### Tasks
+
+* `graphwalks_128k`: Up to 128k context length
+* `graphwalks_1M`: Between 256k-1M context length
+
+> [!NOTE]
+> Please note that `max_gen_toks` is set to `16384`, but non-reasoning models do not need this many tokens.
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
diff --git a/lm_eval/tasks/graphwalks/graphwalks.yaml b/lm_eval/tasks/graphwalks/graphwalks.yaml
@@ -0,0 +1,11 @@
+group: graphwalks
+task:
+  - graphwalks_128k
+  - graphwalks_1M
+aggregate_metric_list:
+  - metric: f1
+    weight_by_size: true
+  - metric: flexible_f1
+    weight_by_size: true
+metadata:
+  version: 0.0
diff --git a/lm_eval/tasks/graphwalks/graphwalks_128k.yaml b/lm_eval/tasks/graphwalks/graphwalks_128k.yaml
@@ -0,0 +1,25 @@
+task: graphwalks_128k
+custom_dataset: !function utils.load_dataset
+dataset_kwargs:
+  data_file: graphwalks_128k_and_shorter.parquet
+output_type: generate_until
+test_split: train
+doc_to_text: "{{prompt}}"
+doc_to_target: "{{answer_nodes}}"
+process_results: !function utils.process_results
+target_delimiter: ""
+generation_kwargs:
+  until:
+    - "</s>"
+    - "<|im_end|>"
+    - "<|endoftext|>"
+  max_gen_toks: 16384
+metric_list:
+  - metric: f1
+    aggregation: mean
+    higher_is_better: true
+  - metric: flexible_f1
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
diff --git a/lm_eval/tasks/graphwalks/graphwalks_1M.yaml b/lm_eval/tasks/graphwalks/graphwalks_1M.yaml
@@ -0,0 +1,25 @@
+task: graphwalks_1M
+custom_dataset: !function utils.load_dataset
+dataset_kwargs:
+  data_file: graphwalks_256k_to_1mil.parquet
+output_type: generate_until
+test_split: train
+doc_to_text: "{{prompt}}"
+doc_to_target: "{{answer_nodes}}"
+process_results: !function utils.process_results
+target_delimiter: ""
+generation_kwargs:
+  until:
+    - "</s>"
+    - "<|im_end|>"
+    - "<|endoftext|>"
+  max_gen_toks: 16384
+metric_list:
+  - metric: f1
+    aggregation: mean
+    higher_is_better: true
+  - metric: flexible_f1
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
diff --git a/lm_eval/tasks/graphwalks/utils.py b/lm_eval/tasks/graphwalks/utils.py
@@ -0,0 +1,161 @@
+import re
+from typing import List, Tuple
+
+import datasets
+
+
+def load_dataset(**kwargs):
+    """
+    Load the graphwalks dataset with specific data file.
+
+    Args:
+        kwargs: Must contain 'data_file' key specifying which parquet file to load
+
+    Returns:
+        Dictionary with 'train' split containing the dataset
+    """
+    data_file = kwargs.get("data_file")
+    if not data_file:
+        raise ValueError("data_file must be specified in dataset_kwargs")
+
+    dataset = datasets.load_dataset(
+        "openai/graphwalks", data_files=data_file, split="train"
+    )
+    return {"train": dataset}
+
+
+def extract_answer_list(response: str) -> Tuple[List[str], bool]:
+    """
+    Extract the answer list from a model response.
+
+    Args:
+        response: The model's generated response
+
+    Returns:
+        Tuple of (list of nodes, is_error)
+        - list of nodes: extracted node IDs
+        - is_error: True if parsing failed, False otherwise
+    """
+    # Get the very last line of the response (strip trailing newlines first)
+    line = response.rstrip("\n").split("\n")[-1]
+
+    # Check if formatted correctly
+    if "Final Answer:" not in line:
+        return [], True
+
+    # Extract the list part using regex with capturing group
+    match = re.search(r"Final Answer:\s*\[(.*)\]", line)
+    if match:
+        # Extract content between brackets using group(1)
+        bracket_content = match.group(1)
+        # Handle empty list case
+        if not bracket_content.strip():
+            return [], False
+        # Split by comma and clean up whitespace and quotes
+        result_list = [
+            item.strip().strip("'\"")
+            for item in bracket_content.split(",")
+            if item.strip()
+        ]
+        return result_list, False
+    else:
+        return [], True
+
+
+def extract_answer_list_flexible(response: str) -> Tuple[List[str], bool]:
+    """
+    Extract the answer list from a model response (flexible version).
+    Searches backwards through all lines to find "Final Answer:" pattern.
+    More lenient than extract_answer_list which only checks the last line.
+
+    Args:
+        response: The model's generated response
+
+    Returns:
+        Tuple of (list of nodes, is_error)
+        - list of nodes: extracted node IDs
+        - is_error: True if parsing failed, False otherwise
+    """
+    lines = response.rstrip("\n").split("\n")
+    for line in reversed(lines):
+        match = re.search(r"Final Answer:\s*\[(.*)\]", line)
+        if match:
+            # Extract content between brackets using group(1)
+            bracket_content = match.group(1)
+            # Handle empty list case
+            if not bracket_content.strip():
+                return [], False
+            # Split by comma and clean up whitespace and quotes
+            result_list = [
+                item.strip().strip("'\"")
+                for item in bracket_content.split(",")
+                if item.strip()
+            ]
+            return result_list, False
+
+    # No "Final Answer:" found anywhere
+    return [], True
+
+
+def process_results(doc, results):
+    """
+    Process results and compute set-based F1 scores.
+    Returns both strict F1 (last line only) and flexible F1 (search all lines).
+
+    Args:
+        doc: Document containing ground truth answer_nodes
+        results: List containing model generation
+
+    Returns:
+        Dictionary with f1 and flexible_f1 scores
+    """
+    # Extract model response (first element of results)
+    response = results[0]
+
+    # Get ground truth nodes
+    gold_nodes = doc["answer_nodes"]
+
+    # Parse the response using strict extraction
+    predicted_nodes_strict, _ = extract_answer_list(response)
+    sampled_set_strict = set(predicted_nodes_strict)
+    truth_set = set(gold_nodes)
+
+    # Calculate strict F1
+    n_overlap_strict = len(sampled_set_strict & truth_set)
+    n_sampled_strict = len(sampled_set_strict)
+    n_golden = len(truth_set)
+
+    recall_strict = n_overlap_strict / n_golden if n_golden > 0 else 0.0
+    precision_strict = (
+        n_overlap_strict / n_sampled_strict if n_sampled_strict > 0 else 0.0
+    )
+    f1_strict = (
+        2 * (recall_strict * precision_strict) / (recall_strict + precision_strict)
+        if (recall_strict + precision_strict) > 0
+        else 0.0
+    )
+
+    # Parse the response using flexible extraction
+    predicted_nodes_flexible, _ = extract_answer_list_flexible(response)
+    sampled_set_flexible = set(predicted_nodes_flexible)
+
+    # Calculate flexible F1
+    n_overlap_flexible = len(sampled_set_flexible & truth_set)
+    n_sampled_flexible = len(sampled_set_flexible)
+
+    recall_flexible = n_overlap_flexible / n_golden if n_golden > 0 else 0.0
+    precision_flexible = (
+        n_overlap_flexible / n_sampled_flexible if n_sampled_flexible > 0 else 0.0
+    )
+    f1_flexible = (
+        2
+        * (recall_flexible * precision_flexible)
+        / (recall_flexible + precision_flexible)
+        if (recall_flexible + precision_flexible) > 0
+        else 0.0
+    )
+
+    return {
+        "f1": f1_strict,
+        "flexible_f1": f1_flexible,
+    }