Skip to content

remove metrics input#348

Draft
nihaase wants to merge 1 commit intomainfrom
feat/remove-metrics-input
Draft

remove metrics input#348
nihaase wants to merge 1 commit intomainfrom
feat/remove-metrics-input

Conversation

@nihaase
Copy link
Collaborator

@nihaase nihaase commented Mar 10, 2026

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes the user-configurable metrics input from study configuration and runtime context, shifting metric selection toward target_metric and (in some modules) automatic derivation via Metrics.get_by_type().

Changes:

  • Remove metrics from OctoStudy/subclasses and from StudyContext, and drop it from context construction.
  • Update Octo/AutoGluon modules to derive metric sets via Metrics.get_by_type(study_context.ml_type) and adjust scoring calls accordingly.
  • Update tests and examples to stop passing metrics=... and remove tests that validated metrics configuration.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
octopus/study/core.py Removes metrics API (abstract property + attrs fields) and stops passing it into StudyContext.
octopus/study/context.py Drops metrics from the frozen runtime context.
octopus/modules/octo/core.py Changes Octo module result scoring to concatenate scores for all metrics derived from ML type.
octopus/modules/octo/bag.py Adds optional metric parameter to get_performance() and threads it through get_performance_df().
octopus/modules/autogluon/core.py Switches metric list derivation to Metrics.get_by_type(study_context.ml_type).
tests/workflows/test_roc_octo_roc_workflow.py Removes metrics=[...] from workflow execution test study instantiation.
tests/workflows/test_octo_t2e.py Removes metrics=[...] from time-to-event workflow execution test.
tests/workflows/test_octo_regression.py Removes metrics=[...] from regression workflow execution test.
tests/workflows/test_octo_multiclass.py Removes multiclass metrics test and stops passing metrics=[...] in other multiclass tests.
tests/workflows/test_octo_classification.py Removes metrics=[...] from classification workflow execution test.
tests/workflows/test_ag_workflows.py Removes metrics=[...] from AutoGluon workflow tests.
tests/study/test_core.py Removes parametrized test that asserted default/custom metrics behaviour.
tests/predict/test_predict.py Removes metrics=[...] from study creation helper.
tests/manager/test_core.py Removes metrics=[...] from StudyContext fixture.
tests/infrastructure/test_fsspec.py Removes metrics=[...] from fsspec-backed experiment test.
examples/wf_roc_octo.py Removes metrics=[...] from example study construction.
examples/wf_octo_autogluon.py Removes metrics=[...] from example study construction.
examples/wf_multiclass_wine.py Removes metrics=[...] from example study construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +94 to +98
all_metrics = Metrics.get_by_type(study_context.ml_type)
best_scores = pd.concat(
[best_bag.get_performance_df(metric=m) for m in all_metrics],
ignore_index=True,
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OctoModuleTemplate.fit() now concatenates score DataFrames for all metrics of the ML type. Several downstream consumers (e.g. StudyLoader.build_performance_summary() / show_target_metric_performance) currently assume scores.parquet effectively contains a single metric and build keys that omit the metric column, which will silently overwrite values when multiple metrics are present. Either keep scores limited to study_context.target_metric, or update the downstream aggregation to include the metric name in its keys / grouping so all metrics are preserved.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +98
all_metrics = Metrics.get_by_type(study_context.ml_type)
best_scores = pd.concat(
[best_bag.get_performance_df(metric=m) for m in all_metrics],
ignore_index=True,
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing best_scores by calling best_bag.get_performance_df() once per metric likely recomputes predictions/performance repeatedly (each call goes through get_predictions() / get_performance_from_predictions()). If you want to persist multiple metrics, consider computing predictions once and evaluating multiple metrics from that cached structure (or extending the bag API to return a multi-metric scores DataFrame in one pass) to avoid an avoidable O(#metrics) overhead per outer split.

Copilot uses AI. Check for mistakes.
Comment on lines +390 to 398
def get_performance(self, metric: str | None = None):
"""Get performance using get_performance_from_predictions utility.

This is a simpler alternative to get_performance() that:
1. Gets predictions from bag.get_predictions()
2. Calculates performance using get_performance_from_predictions()
3. Restructures output to match expected format
Args:
metric: The metric to evaluate. Defaults to self.target_metric when None.

Returns:
dict: Dictionary with performance values in the same format as get_performance()
"""
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated docstring/return description is self-referential: it says the returned dict matches the format of get_performance(), but this method is get_performance(). Clarify what shape the returned dict has (keys like train_avg, dev_avg, etc.) and/or what legacy method it is intended to be compatible with.

Copilot uses AI. Check for mistakes.
Comment on lines 356 to +358
# Test scores using Octopus metrics for comparison
assert study_context.target_metric is not None, "target_metric should be set during fit()"
all_metrics = list(dict.fromkeys([*study_context.metrics, study_context.target_metric]))
all_metrics = Metrics.get_by_type(study_context.ml_type)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from the explicit [...study_context.metrics, study_context.target_metric] list to Metrics.get_by_type(study_context.ml_type) will evaluate (and persist) every registered metric for the ML type. This can materially increase runtime and the size/noise of scores.parquet/performance_results.json, and it also hits the same downstream aggregation issue where multiple metrics in scores.parquet are not preserved. Consider restricting this comparison set to {study_context.target_metric} (or a small curated set) unless/until the rest of the pipeline is updated to handle multi-metric score artifacts.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distinguish between target_metric and result_metrics

2 participants