Adjust trait listing to be evaluation-type agnostic #958

dittops · 2025-12-01T12:13:17Z

Summary

stop filtering trait listings by evaluation type so traits like reasoning, logic, and factuality remain visible regardless of eval mode
keep trait queries constrained to datasets they are linked to while simplifying the API surface

Testing

python -m compileall services/budapp/budapp/eval_ops/eval_routes.py services/budapp/budapp/eval_ops/services.py

Codex Task

chatgpt-codex-connector · 2025-12-01T12:13:24Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

gemini-code-assist · 2025-12-01T12:13:38Z

Summary of Changes

Hello @dittops, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors how evaluation traits and datasets are listed and how evaluation tasks are configured and executed. The primary goal is to decouple trait visibility from specific evaluation types and to introduce more flexible and dynamic handling of evaluation task types across different datasets and models. This improves the system's adaptability to various evaluation modes beyond just 'gen' (generation).

Highlights

Trait Listing Flexibility: The API for listing traits (list_traits) no longer filters by a specific evaluation type ('gen'), ensuring all traits linked to any dataset are visible regardless of their associated evaluation modes.
Dynamic Dataset Filtering: The list_datasets API now supports filtering by a specific eval_type (e.g., 'ppl', 'judge') via a new query parameter, providing more granular control over dataset discovery.
Enhanced Evaluation Task Type Resolution: The evaluation triggering mechanism has been improved to dynamically determine the appropriate evaluation task type for each dataset, including robust fallback logic and support for handling multiple task types within a single experiment.
OpenCompass Integration for Task Types: The OpenCompass transformer now correctly interprets and applies the dynamically resolved evaluation task types, including conditional handling of logprobs for PPL (Perplexity) evaluations based on model capabilities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully makes the trait listing evaluation-type agnostic and simplifies the API for querying traits. The changes are primarily within the eval_ops services and routes. My review has identified a high-severity bug in the list_traits query that could lead to incorrect results and pagination. I have also provided several medium-severity suggestions to improve code structure and maintainability, such as moving helper functions and business logic out of route handlers and into more appropriate service layers. Addressing the bug is critical, and the other suggestions should improve the overall quality of the code.

gemini-code-assist · 2025-12-01T12:15:20Z

services/budapp/budapp/eval_ops/services.py

            q = (
                self.session.query(TraitModel)
                .join(PivotModel, TraitModel.id == PivotModel.trait_id)
                .join(DatasetModel, PivotModel.dataset_id == DatasetModel.id)
-                .filter(DatasetModel.eval_types.op("?")("gen"))  # Filter datasets with 'gen' key in eval_types
-                .distinct()
            )


The removal of .distinct() from this query will cause the function to return duplicate traits if a trait is associated with multiple datasets. Furthermore, the total_count calculated later will be incorrect, as it will count the total number of trait-dataset associations rather than the number of unique traits, leading to incorrect pagination. Please add .distinct() back to the query to ensure unique traits are returned and the count is accurate.

Suggested change

q = (

self.session.query(TraitModel)

.join(PivotModel, TraitModel.id == PivotModel.trait_id)

.join(DatasetModel, PivotModel.dataset_id == DatasetModel.id)

.filter(DatasetModel.eval_types.op("?")("gen")) # Filter datasets with 'gen' key in eval_types

.distinct()

)

q = (

self.session.query(TraitModel)

.join(PivotModel, TraitModel.id == PivotModel.trait_id)

.join(DatasetModel, PivotModel.dataset_id == DatasetModel.id).distinct()

)

gemini-code-assist · 2025-12-01T12:15:20Z

services/budapp/budapp/eval_ops/eval_routes.py

            trait_ids=trait_id_list,
-            has_gen_eval_type=has_gen_eval_type,
+            eval_type=eval_type,
+            has_gen_eval_type=has_gen_eval_type if eval_type is None else None,


The logic to make eval_type and has_gen_eval_type mutually exclusive is currently in the route handler. It's generally better to keep route handlers thin and move such business logic into the service layer. This improves separation of concerns and makes the code easier to test and maintain. The service layer's list_datasets method already handles this logic correctly, so you can simplify this call.

Suggested change

has_gen_eval_type=has_gen_eval_type if eval_type is None else None,

has_gen_eval_type=has_gen_eval_type,

gemini-code-assist · 2025-12-01T12:15:20Z

services/budapp/budapp/eval_ops/services.py

+            def _resolve_requested_task_type(config: dict | None) -> str:
+                if not config:
+                    return "gen"
+
+                for key in ["task_type", "eval_type", "evaluation_type"]:
+                    value = config.get(key)
+                    if isinstance(value, dict) and value.get("mode"):
+                        return str(value.get("mode")).lower()
+                    if isinstance(value, str):
+                        return value.lower()
+                return "gen"
+
+            def _pick_dataset_eval(dataset, requested_type: str) -> tuple[str | None, str | None]:
+                eval_types = dataset.eval_types if isinstance(dataset.eval_types, dict) else {}
+                if requested_type and requested_type in eval_types:
+                    return requested_type, eval_types[requested_type]
+
+                if "gen" in eval_types:
+                    logger.debug(
+                        "Defaulting to 'gen' eval_type for dataset %s because requested type '%s' is unavailable",
+                        dataset.name,
+                        requested_type,
+                    )
+                    return "gen", eval_types["gen"]
+
+                if eval_types:
+                    fallback_type, fallback_config = next(iter(eval_types.items()))
+                    logger.warning(
+                        "Dataset %s does not support requested eval_type '%s'; falling back to '%s'",
+                        dataset.name,
+                        requested_type,
+                        fallback_type,
+                    )
+                    return str(fallback_type), fallback_config
+
+                return None, None


The helper functions _resolve_requested_task_type and _pick_dataset_eval are defined within the _trigger_evaluations_for_experiment_and_get_response method. Nesting functions like this can make the code harder to read, test, and reuse. Since these helpers don't rely on the outer method's local scope (other than logger), they would be better as static methods of the class or as module-level functions.

gemini-code-assist · 2025-12-01T12:15:21Z

services/budapp/budapp/eval_ops/services.py

+
+                return None, None

            # Add datasets from each run (avoiding duplicates)


The comment on this line, # Add datasets from each run (avoiding duplicates), is now misleading. The previous logic for avoiding duplicates has been removed, and the current implementation appends a dataset for every run. If evaluating each run is the intended behavior, please update the comment to reflect this by removing the (avoiding duplicates) part to prevent confusion for future developers.

Suggested change

# Add datasets from each run (avoiding duplicates)

# Add datasets from each run

gemini-code-assist · 2025-12-01T12:15:21Z

services/budeval/budeval/engines/opencompass/transformer.py

+        def _coerce_bool_flag(value, default: bool = False) -> bool:
+            if isinstance(value, bool):
+                return value
+            if isinstance(value, str):
+                return value.strip().lower() in {"1", "true", "yes", "y", "on"}
+            return default


The helper function _coerce_bool_flag is a general utility that is not specific to the transform method. Defining it inside the method limits its reusability and makes the transform method longer. It would be better to define it at the module level so it can be reused elsewhere.

fix(budapp): decouple trait listing from eval types

6059bfb

dittops added the codex label Dec 1, 2025 — with ChatGPT Codex Connector

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust trait listing to be evaluation-type agnostic #958

Adjust trait listing to be evaluation-type agnostic #958

Uh oh!

dittops commented Dec 1, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	has_gen_eval_type=has_gen_eval_type if eval_type is None else None,
	has_gen_eval_type=has_gen_eval_type,


		return None, None

		# Add datasets from each run (avoiding duplicates)

	# Add datasets from each run (avoiding duplicates)
	# Add datasets from each run

Adjust trait listing to be evaluation-type agnostic #958

Are you sure you want to change the base?

Adjust trait listing to be evaluation-type agnostic #958

Uh oh!

Conversation

dittops commented Dec 1, 2025

Summary

Testing

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants