feat: add RULER long context evaluation #250

jaideepr97 · 2025-05-02T18:28:25Z

Adds ability to run RULER benchmark via lm-eval against an arbitrary model using a local openai endpoint

RobotSail

Thank you for the PR @jaideepr97 ! I left some feedback but overall this looks like it's in a pretty good shape.

RobotSail · 2025-05-07T01:40:26Z

src/instructlab/eval/ruler.py

+        vllm_config: Optional[Dict[str, Any]] = None,
+        hf_config: Optional[Dict[str, Any]] = None,


I don't think you need these since you're only calling this through a local OpenAI endpoint

RobotSail · 2025-05-07T01:40:36Z

src/instructlab/eval/ruler.py

+        model_path: Optional[str] = None,
+        output_file: Optional[str] = None,
+        tasks: list[str] = RULER_TASKS,
+        num_gpus: Optional[int] = None,


You can drop this too

RobotSail · 2025-05-07T01:42:44Z

src/instructlab/eval/ruler.py

+        eval_config: Optional[Dict[str, Any]] = None,
+        vllm_config: Optional[Dict[str, Any]] = None,
+        hf_config: Optional[Dict[str, Any]] = None,
+        openai_config: Optional[Dict[str, Any]] = None,


I see that you've declared this as an argument and are reading it further down, but I don't see it actually being passed into lm-eval-harness. You'll want to make sure we're doing this by following how I did it here:

eval/src/instructlab/eval/leaderboard.py

Line 745 in f48c6a1

final_openai_config = {**self.openai_config, **(openai_config or {})}

got rid of these for now, since they werent being used

RobotSail · 2025-05-07T01:44:08Z

src/instructlab/eval/ruler.py

+
+        self.api_endpoint = api_endpoint or None
+        self.num_gpus = num_gpus
+        self.max_length = max_length or 4096


4096 here seems arbitrary, I would move it up to be a top-level constant and include a comment explaining why you're picking this number as a default (it's the lm-eval-harness default I'm pretty sure).

I see the constant defined above but it doesn't seem like you're using it

RobotSail · 2025-05-07T01:45:51Z

src/instructlab/eval/ruler.py

+        model_path = self.model_path if model_path is None else model_path
+        tasks = self.tasks if not tasks else tasks
+        output_file = self.output_file if not output_file else output_file
+


We probably want to add a validation step here before we run through simple_evaluate to guarantee that the model_path, tasks and any other variables we've defined above are for sure not None (and any other validations that we would ∑ant to check, like ensuring output_file can be written to).

fixed, just raising errors for now

RobotSail · 2025-05-07T01:53:55Z

src/instructlab/eval/ruler.py

+        """
+        unqiue_metrics_dict: dict[str, Any] = {}
+
+        def extract_metrics(results: dict, unqiue_metrics_dict: dict = {}):


You may want to add a comment here explaining why we need to do this so anyone who's unfamiliar with lm-eval-harness or even this particular benchmark would have an idea of what this solves. It would be good even to provide an example of the schema that we're dealing with just as a code comment.

Nice, this is perfect. Thank you @jaideepr97 !

RobotSail · 2025-05-07T01:55:26Z

src/instructlab/eval/ruler.py

+        unique_float_metrics = {}
+        # if value is list of floats, average the list
+        for k, v in unqiue_metrics_dict.items():
+            if isinstance(v, list) and all(isinstance(i, float) for i in v):


Nice I like this

RobotSail · 2025-05-07T01:56:27Z

src/instructlab/eval/ruler.py

+
+        # result format
+        # {'8192': 0.90, '32768': 0.82, '65536': 0.77, '131072': 0.71, 'avg': 0.80}
+        self.results = unique_float_metrics


@jaideepr97 Would it make sense for this method to return this value instead of storing it? And then the caller can decide to store it on the object if they want? I.e. this can just be made into a static class method I feel like.

RobotSail · 2025-05-07T01:58:14Z

src/instructlab/eval/ruler.py

+            tasks=tasks,
+        )
+
+        self.process_lm_eval_results(


Here I would just get a value returned from this method, store it on the object and then write it to disk.

RobotSail · 2025-05-07T01:59:12Z

src/instructlab/eval/ruler.py

+            "max_length": max_length,
+        }
+
+        lm_eval_results = simple_evaluate(


Might be worthwhile to store the raw result from lm-eval-harness somewhere. While there are cleaner solutions, I think it would suffice to store it on a ._result field.

RobotSail

LGTM!

RobotSail · 2025-05-12T04:12:21Z

src/instructlab/eval/ruler.py

+        self,
+        model_path: Optional[str] = None,
+        output_file: Optional[str] = None,
+        tasks: list[str] = RULER_TASKS,


In Python, you want to avoid using complex objects like Lists, Dicts, Sets, etc. as defaults when declaring functions. The reason is that you are signing the default as a pointer to the instantiated object (in this case RULER_TASKS). So the workaround we do is to have a function setup like:

def foo(my_list: list = None): if not my_list: my_list = []

In the case of this __init__ method, since RULER_TASKS is a list which already contains data, we would want to create a copy of it for this default, which we can do with the slice notation ([:]):

def __init__(self, tasks: list[str] = None): if not tasks: # create copies of the strings in RULER_TASKS self.tasks = RULER_TASKS[:] else: self.tasks = tasks

RobotSail · 2025-05-12T04:14:26Z

src/instructlab/eval/ruler.py

+
+        self.api_endpoint = api_endpoint or None
+        self.num_gpus = num_gpus
+        self.max_length = max_length or 4096


I see the constant defined above but it doesn't seem like you're using it

RobotSail · 2025-05-12T04:15:18Z

src/instructlab/eval/ruler.py

+        """
+        unqiue_metrics_dict: dict[str, Any] = {}
+
+        def extract_metrics(results: dict, unqiue_metrics_dict: dict = {}):


Nice, this is perfect. Thank you @jaideepr97 !

RobotSail · 2025-06-02T06:47:46Z

@mergify rebase

Signed-off-by: Jaideep Rao <[email protected]>

mergify · 2025-06-02T06:47:54Z

rebase

✅ Branch has been successfully rebased

jaideepr97 closed this May 2, 2025

jaideepr97 reopened this May 2, 2025

mergify bot added the ci-failure label May 2, 2025

jaideepr97 marked this pull request as draft May 2, 2025 18:38

jaideepr97 changed the title ~~test run~~ feat: add RULER long context evaluation May 2, 2025

jaideepr97 changed the title ~~feat: add RULER long context evaluation~~ [WIP] feat: add RULER long context evaluation May 2, 2025

jaideepr97 force-pushed the add-ruler branch from 0617417 to 97ed0ff Compare May 5, 2025 16:17

mergify bot removed the ci-failure label May 5, 2025

jaideepr97 force-pushed the add-ruler branch from 97ed0ff to 5a85998 Compare May 5, 2025 16:19

mergify bot added the ci-failure label May 5, 2025

jaideepr97 force-pushed the add-ruler branch from 5a85998 to fd3b1d4 Compare May 5, 2025 16:30

mergify bot added ci-failure and removed ci-failure labels May 5, 2025

jaideepr97 force-pushed the add-ruler branch from fd3b1d4 to 8eaa362 Compare May 5, 2025 16:33

mergify bot added ci-failure and removed ci-failure labels May 5, 2025

jaideepr97 force-pushed the add-ruler branch from 8eaa362 to 332df85 Compare May 5, 2025 17:00

mergify bot added ci-failure and removed ci-failure labels May 5, 2025

jaideepr97 force-pushed the add-ruler branch from 332df85 to b198986 Compare May 5, 2025 17:02

mergify bot removed the ci-failure label May 5, 2025

jaideepr97 force-pushed the add-ruler branch from b198986 to 684bf70 Compare May 5, 2025 17:03

mergify bot added the ci-failure label May 5, 2025

jaideepr97 force-pushed the add-ruler branch from 684bf70 to 0d322ef Compare May 5, 2025 17:18

mergify bot added ci-failure and removed ci-failure labels May 5, 2025

jaideepr97 force-pushed the add-ruler branch from 0d322ef to 64ea553 Compare May 6, 2025 15:29

mergify bot removed the ci-failure label May 6, 2025

jaideepr97 force-pushed the add-ruler branch 2 times, most recently from f1ae970 to 006b4cc Compare May 6, 2025 15:32

jaideepr97 force-pushed the add-ruler branch 4 times, most recently from 1efb8f2 to a8e2f58 Compare May 6, 2025 15:46

jaideepr97 marked this pull request as ready for review May 6, 2025 15:46

jaideepr97 changed the title ~~[WIP] feat: add RULER long context evaluation~~ feat: add RULER long context evaluation May 6, 2025

mergify bot added the ci-failure label May 6, 2025

jaideepr97 force-pushed the add-ruler branch from a8e2f58 to 4516ce1 Compare May 6, 2025 18:33

mergify bot added dependencies Pull requests that update a dependency file ci-failure and removed ci-failure labels May 6, 2025

RobotSail reviewed May 7, 2025

View reviewed changes

jaideepr97 force-pushed the add-ruler branch from 4516ce1 to c70b9de Compare May 7, 2025 17:31

mergify bot added testing Relates to testing and removed ci-failure labels May 7, 2025

jaideepr97 force-pushed the add-ruler branch from c70b9de to 6e358ec Compare May 7, 2025 18:02

mergify bot added the ci-failure label May 7, 2025

RobotSail approved these changes May 12, 2025

View reviewed changes

mergify bot added one-approval ci-failure and removed ci-failure labels May 12, 2025

feat: add ability to run RULER benchmark against a local openai endpoint

d0c1f5c

Signed-off-by: Jaideep Rao <[email protected]>

RobotSail force-pushed the add-ruler branch from 6e358ec to d0c1f5c Compare June 2, 2025 06:47

mergify bot removed the ci-failure label Jun 2, 2025

RobotSail merged commit 19e2b99 into instructlab:main Jun 2, 2025
16 checks passed

		vllm_config: Optional[Dict[str, Any]] = None,
		hf_config: Optional[Dict[str, Any]] = None,

feat: add RULER long context evaluation #250

feat: add RULER long context evaluation #250

Uh oh!

Conversation

jaideepr97 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail commented Jun 2, 2025

Uh oh!

mergify bot commented Jun 2, 2025

✅ Branch has been successfully rebased

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaideepr97 commented May 2, 2025 •

edited

Loading