-
Notifications
You must be signed in to change notification settings - Fork 26
feat: add RULER long context evaluation #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f1ae970 to
006b4cc
Compare
1efb8f2 to
a8e2f58
Compare
RobotSail
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @jaideepr97 ! I left some feedback but overall this looks like it's in a pretty good shape.
src/instructlab/eval/ruler.py
Outdated
| vllm_config: Optional[Dict[str, Any]] = None, | ||
| hf_config: Optional[Dict[str, Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need these since you're only calling this through a local OpenAI endpoint
src/instructlab/eval/ruler.py
Outdated
| model_path: Optional[str] = None, | ||
| output_file: Optional[str] = None, | ||
| tasks: list[str] = RULER_TASKS, | ||
| num_gpus: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop this too
src/instructlab/eval/ruler.py
Outdated
| eval_config: Optional[Dict[str, Any]] = None, | ||
| vllm_config: Optional[Dict[str, Any]] = None, | ||
| hf_config: Optional[Dict[str, Any]] = None, | ||
| openai_config: Optional[Dict[str, Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you've declared this as an argument and are reading it further down, but I don't see it actually being passed into lm-eval-harness. You'll want to make sure we're doing this by following how I did it here:
eval/src/instructlab/eval/leaderboard.py
Line 745 in f48c6a1
| final_openai_config = {**self.openai_config, **(openai_config or {})} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got rid of these for now, since they werent being used
|
|
||
| self.api_endpoint = api_endpoint or None | ||
| self.num_gpus = num_gpus | ||
| self.max_length = max_length or 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4096 here seems arbitrary, I would move it up to be a top-level constant and include a comment explaining why you're picking this number as a default (it's the lm-eval-harness default I'm pretty sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the constant defined above but it doesn't seem like you're using it
| model_path = self.model_path if model_path is None else model_path | ||
| tasks = self.tasks if not tasks else tasks | ||
| output_file = self.output_file if not output_file else output_file | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to add a validation step here before we run through simple_evaluate to guarantee that the model_path, tasks and any other variables we've defined above are for sure not None (and any other validations that we would ∑ant to check, like ensuring output_file can be written to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, just raising errors for now
| """ | ||
| unqiue_metrics_dict: dict[str, Any] = {} | ||
|
|
||
| def extract_metrics(results: dict, unqiue_metrics_dict: dict = {}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to add a comment here explaining why we need to do this so anyone who's unfamiliar with lm-eval-harness or even this particular benchmark would have an idea of what this solves. It would be good even to provide an example of the schema that we're dealing with just as a code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is perfect. Thank you @jaideepr97 !
| unique_float_metrics = {} | ||
| # if value is list of floats, average the list | ||
| for k, v in unqiue_metrics_dict.items(): | ||
| if isinstance(v, list) and all(isinstance(i, float) for i in v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice I like this
src/instructlab/eval/ruler.py
Outdated
|
|
||
| # result format | ||
| # {'8192': 0.90, '32768': 0.82, '65536': 0.77, '131072': 0.71, 'avg': 0.80} | ||
| self.results = unique_float_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaideepr97 Would it make sense for this method to return this value instead of storing it? And then the caller can decide to store it on the object if they want? I.e. this can just be made into a static class method I feel like.
src/instructlab/eval/ruler.py
Outdated
| tasks=tasks, | ||
| ) | ||
|
|
||
| self.process_lm_eval_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would just get a value returned from this method, store it on the object and then write it to disk.
src/instructlab/eval/ruler.py
Outdated
| "max_length": max_length, | ||
| } | ||
|
|
||
| lm_eval_results = simple_evaluate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worthwhile to store the raw result from lm-eval-harness somewhere. While there are cleaner solutions, I think it would suffice to store it on a ._result field.
RobotSail
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| self, | ||
| model_path: Optional[str] = None, | ||
| output_file: Optional[str] = None, | ||
| tasks: list[str] = RULER_TASKS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, you want to avoid using complex objects like Lists, Dicts, Sets, etc. as defaults when declaring functions. The reason is that you are signing the default as a pointer to the instantiated object (in this case RULER_TASKS). So the workaround we do is to have a function setup like:
def foo(my_list: list = None):
if not my_list:
my_list = []In the case of this __init__ method, since RULER_TASKS is a list which already contains data, we would want to create a copy of it for this default, which we can do with the slice notation ([:]):
def __init__(self, tasks: list[str] = None):
if not tasks:
# create copies of the strings in RULER_TASKS
self.tasks = RULER_TASKS[:]
else:
self.tasks = tasks|
|
||
| self.api_endpoint = api_endpoint or None | ||
| self.num_gpus = num_gpus | ||
| self.max_length = max_length or 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the constant defined above but it doesn't seem like you're using it
| """ | ||
| unqiue_metrics_dict: dict[str, Any] = {} | ||
|
|
||
| def extract_metrics(results: dict, unqiue_metrics_dict: dict = {}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is perfect. Thank you @jaideepr97 !
|
@mergify rebase |
Signed-off-by: Jaideep Rao <[email protected]>
✅ Branch has been successfully rebased |
Adds ability to run RULER benchmark via lm-eval against an arbitrary model using a local openai endpoint