Skip to content

Conversation

@jaideepr97
Copy link
Member

@jaideepr97 jaideepr97 commented May 2, 2025

Adds ability to run RULER benchmark via lm-eval against an arbitrary model using a local openai endpoint

@jaideepr97 jaideepr97 closed this May 2, 2025
@jaideepr97 jaideepr97 reopened this May 2, 2025
@mergify mergify bot added the ci-failure label May 2, 2025
@jaideepr97 jaideepr97 marked this pull request as draft May 2, 2025 18:38
@jaideepr97 jaideepr97 changed the title test run feat: add RULER long context evaluation May 2, 2025
@jaideepr97 jaideepr97 changed the title feat: add RULER long context evaluation [WIP] feat: add RULER long context evaluation May 2, 2025
@mergify mergify bot removed the ci-failure label May 5, 2025
@mergify mergify bot added the ci-failure label May 5, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels May 5, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels May 5, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels May 5, 2025
@mergify mergify bot removed the ci-failure label May 5, 2025
@mergify mergify bot added the ci-failure label May 5, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels May 5, 2025
@mergify mergify bot removed the ci-failure label May 6, 2025
@jaideepr97 jaideepr97 force-pushed the add-ruler branch 2 times, most recently from f1ae970 to 006b4cc Compare May 6, 2025 15:32
@jaideepr97 jaideepr97 force-pushed the add-ruler branch 4 times, most recently from 1efb8f2 to a8e2f58 Compare May 6, 2025 15:46
@jaideepr97 jaideepr97 marked this pull request as ready for review May 6, 2025 15:46
@jaideepr97 jaideepr97 changed the title [WIP] feat: add RULER long context evaluation feat: add RULER long context evaluation May 6, 2025
@mergify mergify bot added the ci-failure label May 6, 2025
@mergify mergify bot added dependencies Pull requests that update a dependency file ci-failure and removed ci-failure labels May 6, 2025
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @jaideepr97 ! I left some feedback but overall this looks like it's in a pretty good shape.

Comment on lines 42 to 45
vllm_config: Optional[Dict[str, Any]] = None,
hf_config: Optional[Dict[str, Any]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need these since you're only calling this through a local OpenAI endpoint

model_path: Optional[str] = None,
output_file: Optional[str] = None,
tasks: list[str] = RULER_TASKS,
num_gpus: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this too

eval_config: Optional[Dict[str, Any]] = None,
vllm_config: Optional[Dict[str, Any]] = None,
hf_config: Optional[Dict[str, Any]] = None,
openai_config: Optional[Dict[str, Any]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you've declared this as an argument and are reading it further down, but I don't see it actually being passed into lm-eval-harness. You'll want to make sure we're doing this by following how I did it here:

final_openai_config = {**self.openai_config, **(openai_config or {})}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got rid of these for now, since they werent being used


self.api_endpoint = api_endpoint or None
self.num_gpus = num_gpus
self.max_length = max_length or 4096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4096 here seems arbitrary, I would move it up to be a top-level constant and include a comment explaining why you're picking this number as a default (it's the lm-eval-harness default I'm pretty sure).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the constant defined above but it doesn't seem like you're using it

model_path = self.model_path if model_path is None else model_path
tasks = self.tasks if not tasks else tasks
output_file = self.output_file if not output_file else output_file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to add a validation step here before we run through simple_evaluate to guarantee that the model_path, tasks and any other variables we've defined above are for sure not None (and any other validations that we would ∑ant to check, like ensuring output_file can be written to).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, just raising errors for now

"""
unqiue_metrics_dict: dict[str, Any] = {}

def extract_metrics(results: dict, unqiue_metrics_dict: dict = {}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add a comment here explaining why we need to do this so anyone who's unfamiliar with lm-eval-harness or even this particular benchmark would have an idea of what this solves. It would be good even to provide an example of the schema that we're dealing with just as a code comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is perfect. Thank you @jaideepr97 !

unique_float_metrics = {}
# if value is list of floats, average the list
for k, v in unqiue_metrics_dict.items():
if isinstance(v, list) and all(isinstance(i, float) for i in v):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice I like this


# result format
# {'8192': 0.90, '32768': 0.82, '65536': 0.77, '131072': 0.71, 'avg': 0.80}
self.results = unique_float_metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaideepr97 Would it make sense for this method to return this value instead of storing it? And then the caller can decide to store it on the object if they want? I.e. this can just be made into a static class method I feel like.

tasks=tasks,
)

self.process_lm_eval_results(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would just get a value returned from this method, store it on the object and then write it to disk.

"max_length": max_length,
}

lm_eval_results = simple_evaluate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worthwhile to store the raw result from lm-eval-harness somewhere. While there are cleaner solutions, I think it would suffice to store it on a ._result field.

@mergify mergify bot added testing Relates to testing and removed ci-failure labels May 7, 2025
@mergify mergify bot added the ci-failure label May 7, 2025
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

self,
model_path: Optional[str] = None,
output_file: Optional[str] = None,
tasks: list[str] = RULER_TASKS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, you want to avoid using complex objects like Lists, Dicts, Sets, etc. as defaults when declaring functions. The reason is that you are signing the default as a pointer to the instantiated object (in this case RULER_TASKS). So the workaround we do is to have a function setup like:

def foo(my_list: list = None):
  if not my_list:
    my_list = []

In the case of this __init__ method, since RULER_TASKS is a list which already contains data, we would want to create a copy of it for this default, which we can do with the slice notation ([:]):

def __init__(self, tasks: list[str] = None):
  if not tasks:
    # create copies of the strings in RULER_TASKS
    self.tasks = RULER_TASKS[:]
  else:
    self.tasks = tasks


self.api_endpoint = api_endpoint or None
self.num_gpus = num_gpus
self.max_length = max_length or 4096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the constant defined above but it doesn't seem like you're using it

"""
unqiue_metrics_dict: dict[str, Any] = {}

def extract_metrics(results: dict, unqiue_metrics_dict: dict = {}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is perfect. Thank you @jaideepr97 !

@RobotSail
Copy link
Member

@mergify rebase

@mergify
Copy link
Contributor

mergify bot commented Jun 2, 2025

rebase

✅ Branch has been successfully rebased

@mergify mergify bot removed the ci-failure label Jun 2, 2025
@RobotSail RobotSail merged commit 19e2b99 into instructlab:main Jun 2, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file one-approval testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants