Skip to content

Conversation

@dittops
Copy link
Member

@dittops dittops commented Dec 1, 2025

Summary

  • stop filtering trait listings by evaluation type so traits like reasoning, logic, and factuality remain visible regardless of eval mode
  • keep trait queries constrained to datasets they are linked to while simplifying the API surface

Testing

  • python -m compileall services/budapp/budapp/eval_ops/eval_routes.py services/budapp/budapp/eval_ops/services.py

Codex Task

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dittops, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors how evaluation traits and datasets are listed and how evaluation tasks are configured and executed. The primary goal is to decouple trait visibility from specific evaluation types and to introduce more flexible and dynamic handling of evaluation task types across different datasets and models. This improves the system's adaptability to various evaluation modes beyond just 'gen' (generation).

Highlights

  • Trait Listing Flexibility: The API for listing traits (list_traits) no longer filters by a specific evaluation type ('gen'), ensuring all traits linked to any dataset are visible regardless of their associated evaluation modes.
  • Dynamic Dataset Filtering: The list_datasets API now supports filtering by a specific eval_type (e.g., 'ppl', 'judge') via a new query parameter, providing more granular control over dataset discovery.
  • Enhanced Evaluation Task Type Resolution: The evaluation triggering mechanism has been improved to dynamically determine the appropriate evaluation task type for each dataset, including robust fallback logic and support for handling multiple task types within a single experiment.
  • OpenCompass Integration for Task Types: The OpenCompass transformer now correctly interprets and applies the dynamically resolved evaluation task types, including conditional handling of logprobs for PPL (Perplexity) evaluations based on model capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully makes the trait listing evaluation-type agnostic and simplifies the API for querying traits. The changes are primarily within the eval_ops services and routes. My review has identified a high-severity bug in the list_traits query that could lead to incorrect results and pagination. I have also provided several medium-severity suggestions to improve code structure and maintainability, such as moving helper functions and business logic out of route handlers and into more appropriate service layers. Addressing the bug is critical, and the other suggestions should improve the overall quality of the code.

Comment on lines 1045 to 1049
q = (
self.session.query(TraitModel)
.join(PivotModel, TraitModel.id == PivotModel.trait_id)
.join(DatasetModel, PivotModel.dataset_id == DatasetModel.id)
.filter(DatasetModel.eval_types.op("?")("gen")) # Filter datasets with 'gen' key in eval_types
.distinct()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The removal of .distinct() from this query will cause the function to return duplicate traits if a trait is associated with multiple datasets. Furthermore, the total_count calculated later will be incorrect, as it will count the total number of trait-dataset associations rather than the number of unique traits, leading to incorrect pagination. Please add .distinct() back to the query to ensure unique traits are returned and the count is accurate.

Suggested change
q = (
self.session.query(TraitModel)
.join(PivotModel, TraitModel.id == PivotModel.trait_id)
.join(DatasetModel, PivotModel.dataset_id == DatasetModel.id)
.filter(DatasetModel.eval_types.op("?")("gen")) # Filter datasets with 'gen' key in eval_types
.distinct()
)
q = (
self.session.query(TraitModel)
.join(PivotModel, TraitModel.id == PivotModel.trait_id)
.join(DatasetModel, PivotModel.dataset_id == DatasetModel.id).distinct()
)

trait_ids=trait_id_list,
has_gen_eval_type=has_gen_eval_type,
eval_type=eval_type,
has_gen_eval_type=has_gen_eval_type if eval_type is None else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to make eval_type and has_gen_eval_type mutually exclusive is currently in the route handler. It's generally better to keep route handlers thin and move such business logic into the service layer. This improves separation of concerns and makes the code easier to test and maintain. The service layer's list_datasets method already handles this logic correctly, so you can simplify this call.

Suggested change
has_gen_eval_type=has_gen_eval_type if eval_type is None else None,
has_gen_eval_type=has_gen_eval_type,

Comment on lines +4391 to +4426
def _resolve_requested_task_type(config: dict | None) -> str:
if not config:
return "gen"

for key in ["task_type", "eval_type", "evaluation_type"]:
value = config.get(key)
if isinstance(value, dict) and value.get("mode"):
return str(value.get("mode")).lower()
if isinstance(value, str):
return value.lower()
return "gen"

def _pick_dataset_eval(dataset, requested_type: str) -> tuple[str | None, str | None]:
eval_types = dataset.eval_types if isinstance(dataset.eval_types, dict) else {}
if requested_type and requested_type in eval_types:
return requested_type, eval_types[requested_type]

if "gen" in eval_types:
logger.debug(
"Defaulting to 'gen' eval_type for dataset %s because requested type '%s' is unavailable",
dataset.name,
requested_type,
)
return "gen", eval_types["gen"]

if eval_types:
fallback_type, fallback_config = next(iter(eval_types.items()))
logger.warning(
"Dataset %s does not support requested eval_type '%s'; falling back to '%s'",
dataset.name,
requested_type,
fallback_type,
)
return str(fallback_type), fallback_config

return None, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The helper functions _resolve_requested_task_type and _pick_dataset_eval are defined within the _trigger_evaluations_for_experiment_and_get_response method. Nesting functions like this can make the code harder to read, test, and reuse. Since these helpers don't rely on the outer method's local scope (other than logger), they would be better as static methods of the class or as module-level functions.


return None, None

# Add datasets from each run (avoiding duplicates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on this line, # Add datasets from each run (avoiding duplicates), is now misleading. The previous logic for avoiding duplicates has been removed, and the current implementation appends a dataset for every run. If evaluating each run is the intended behavior, please update the comment to reflect this by removing the (avoiding duplicates) part to prevent confusion for future developers.

Suggested change
# Add datasets from each run (avoiding duplicates)
# Add datasets from each run

Comment on lines +20 to +25
def _coerce_bool_flag(value, default: bool = False) -> bool:
if isinstance(value, bool):
return value
if isinstance(value, str):
return value.strip().lower() in {"1", "true", "yes", "y", "on"}
return default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The helper function _coerce_bool_flag is a general utility that is not specific to the transform method. Defining it inside the method limits its reusability and makes the transform method longer. It would be better to define it at the module level so it can be reused elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants