-
Notifications
You must be signed in to change notification settings - Fork 3
Adjust trait listing to be evaluation-type agnostic #958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Summary of ChangesHello @dittops, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors how evaluation traits and datasets are listed and how evaluation tasks are configured and executed. The primary goal is to decouple trait visibility from specific evaluation types and to introduce more flexible and dynamic handling of evaluation task types across different datasets and models. This improves the system's adaptability to various evaluation modes beyond just 'gen' (generation). Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully makes the trait listing evaluation-type agnostic and simplifies the API for querying traits. The changes are primarily within the eval_ops services and routes. My review has identified a high-severity bug in the list_traits query that could lead to incorrect results and pagination. I have also provided several medium-severity suggestions to improve code structure and maintainability, such as moving helper functions and business logic out of route handlers and into more appropriate service layers. Addressing the bug is critical, and the other suggestions should improve the overall quality of the code.
| q = ( | ||
| self.session.query(TraitModel) | ||
| .join(PivotModel, TraitModel.id == PivotModel.trait_id) | ||
| .join(DatasetModel, PivotModel.dataset_id == DatasetModel.id) | ||
| .filter(DatasetModel.eval_types.op("?")("gen")) # Filter datasets with 'gen' key in eval_types | ||
| .distinct() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of .distinct() from this query will cause the function to return duplicate traits if a trait is associated with multiple datasets. Furthermore, the total_count calculated later will be incorrect, as it will count the total number of trait-dataset associations rather than the number of unique traits, leading to incorrect pagination. Please add .distinct() back to the query to ensure unique traits are returned and the count is accurate.
| q = ( | |
| self.session.query(TraitModel) | |
| .join(PivotModel, TraitModel.id == PivotModel.trait_id) | |
| .join(DatasetModel, PivotModel.dataset_id == DatasetModel.id) | |
| .filter(DatasetModel.eval_types.op("?")("gen")) # Filter datasets with 'gen' key in eval_types | |
| .distinct() | |
| ) | |
| q = ( | |
| self.session.query(TraitModel) | |
| .join(PivotModel, TraitModel.id == PivotModel.trait_id) | |
| .join(DatasetModel, PivotModel.dataset_id == DatasetModel.id).distinct() | |
| ) |
| trait_ids=trait_id_list, | ||
| has_gen_eval_type=has_gen_eval_type, | ||
| eval_type=eval_type, | ||
| has_gen_eval_type=has_gen_eval_type if eval_type is None else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to make eval_type and has_gen_eval_type mutually exclusive is currently in the route handler. It's generally better to keep route handlers thin and move such business logic into the service layer. This improves separation of concerns and makes the code easier to test and maintain. The service layer's list_datasets method already handles this logic correctly, so you can simplify this call.
| has_gen_eval_type=has_gen_eval_type if eval_type is None else None, | |
| has_gen_eval_type=has_gen_eval_type, |
| def _resolve_requested_task_type(config: dict | None) -> str: | ||
| if not config: | ||
| return "gen" | ||
|
|
||
| for key in ["task_type", "eval_type", "evaluation_type"]: | ||
| value = config.get(key) | ||
| if isinstance(value, dict) and value.get("mode"): | ||
| return str(value.get("mode")).lower() | ||
| if isinstance(value, str): | ||
| return value.lower() | ||
| return "gen" | ||
|
|
||
| def _pick_dataset_eval(dataset, requested_type: str) -> tuple[str | None, str | None]: | ||
| eval_types = dataset.eval_types if isinstance(dataset.eval_types, dict) else {} | ||
| if requested_type and requested_type in eval_types: | ||
| return requested_type, eval_types[requested_type] | ||
|
|
||
| if "gen" in eval_types: | ||
| logger.debug( | ||
| "Defaulting to 'gen' eval_type for dataset %s because requested type '%s' is unavailable", | ||
| dataset.name, | ||
| requested_type, | ||
| ) | ||
| return "gen", eval_types["gen"] | ||
|
|
||
| if eval_types: | ||
| fallback_type, fallback_config = next(iter(eval_types.items())) | ||
| logger.warning( | ||
| "Dataset %s does not support requested eval_type '%s'; falling back to '%s'", | ||
| dataset.name, | ||
| requested_type, | ||
| fallback_type, | ||
| ) | ||
| return str(fallback_type), fallback_config | ||
|
|
||
| return None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper functions _resolve_requested_task_type and _pick_dataset_eval are defined within the _trigger_evaluations_for_experiment_and_get_response method. Nesting functions like this can make the code harder to read, test, and reuse. Since these helpers don't rely on the outer method's local scope (other than logger), they would be better as static methods of the class or as module-level functions.
|
|
||
| return None, None | ||
|
|
||
| # Add datasets from each run (avoiding duplicates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on this line, # Add datasets from each run (avoiding duplicates), is now misleading. The previous logic for avoiding duplicates has been removed, and the current implementation appends a dataset for every run. If evaluating each run is the intended behavior, please update the comment to reflect this by removing the (avoiding duplicates) part to prevent confusion for future developers.
| # Add datasets from each run (avoiding duplicates) | |
| # Add datasets from each run |
| def _coerce_bool_flag(value, default: bool = False) -> bool: | ||
| if isinstance(value, bool): | ||
| return value | ||
| if isinstance(value, str): | ||
| return value.strip().lower() in {"1", "true", "yes", "y", "on"} | ||
| return default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
Testing
Codex Task