Conversation
There was a problem hiding this comment.
Pull request overview
Removes the user-configurable metrics input from study configuration and runtime context, shifting metric selection toward target_metric and (in some modules) automatic derivation via Metrics.get_by_type().
Changes:
- Remove
metricsfromOctoStudy/subclasses and fromStudyContext, and drop it from context construction. - Update Octo/AutoGluon modules to derive metric sets via
Metrics.get_by_type(study_context.ml_type)and adjust scoring calls accordingly. - Update tests and examples to stop passing
metrics=...and remove tests that validated metrics configuration.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
octopus/study/core.py |
Removes metrics API (abstract property + attrs fields) and stops passing it into StudyContext. |
octopus/study/context.py |
Drops metrics from the frozen runtime context. |
octopus/modules/octo/core.py |
Changes Octo module result scoring to concatenate scores for all metrics derived from ML type. |
octopus/modules/octo/bag.py |
Adds optional metric parameter to get_performance() and threads it through get_performance_df(). |
octopus/modules/autogluon/core.py |
Switches metric list derivation to Metrics.get_by_type(study_context.ml_type). |
tests/workflows/test_roc_octo_roc_workflow.py |
Removes metrics=[...] from workflow execution test study instantiation. |
tests/workflows/test_octo_t2e.py |
Removes metrics=[...] from time-to-event workflow execution test. |
tests/workflows/test_octo_regression.py |
Removes metrics=[...] from regression workflow execution test. |
tests/workflows/test_octo_multiclass.py |
Removes multiclass metrics test and stops passing metrics=[...] in other multiclass tests. |
tests/workflows/test_octo_classification.py |
Removes metrics=[...] from classification workflow execution test. |
tests/workflows/test_ag_workflows.py |
Removes metrics=[...] from AutoGluon workflow tests. |
tests/study/test_core.py |
Removes parametrized test that asserted default/custom metrics behaviour. |
tests/predict/test_predict.py |
Removes metrics=[...] from study creation helper. |
tests/manager/test_core.py |
Removes metrics=[...] from StudyContext fixture. |
tests/infrastructure/test_fsspec.py |
Removes metrics=[...] from fsspec-backed experiment test. |
examples/wf_roc_octo.py |
Removes metrics=[...] from example study construction. |
examples/wf_octo_autogluon.py |
Removes metrics=[...] from example study construction. |
examples/wf_multiclass_wine.py |
Removes metrics=[...] from example study construction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| all_metrics = Metrics.get_by_type(study_context.ml_type) | ||
| best_scores = pd.concat( | ||
| [best_bag.get_performance_df(metric=m) for m in all_metrics], | ||
| ignore_index=True, | ||
| ) |
There was a problem hiding this comment.
OctoModuleTemplate.fit() now concatenates score DataFrames for all metrics of the ML type. Several downstream consumers (e.g. StudyLoader.build_performance_summary() / show_target_metric_performance) currently assume scores.parquet effectively contains a single metric and build keys that omit the metric column, which will silently overwrite values when multiple metrics are present. Either keep scores limited to study_context.target_metric, or update the downstream aggregation to include the metric name in its keys / grouping so all metrics are preserved.
| all_metrics = Metrics.get_by_type(study_context.ml_type) | ||
| best_scores = pd.concat( | ||
| [best_bag.get_performance_df(metric=m) for m in all_metrics], | ||
| ignore_index=True, | ||
| ) |
There was a problem hiding this comment.
Computing best_scores by calling best_bag.get_performance_df() once per metric likely recomputes predictions/performance repeatedly (each call goes through get_predictions() / get_performance_from_predictions()). If you want to persist multiple metrics, consider computing predictions once and evaluating multiple metrics from that cached structure (or extending the bag API to return a multi-metric scores DataFrame in one pass) to avoid an avoidable O(#metrics) overhead per outer split.
| def get_performance(self, metric: str | None = None): | ||
| """Get performance using get_performance_from_predictions utility. | ||
|
|
||
| This is a simpler alternative to get_performance() that: | ||
| 1. Gets predictions from bag.get_predictions() | ||
| 2. Calculates performance using get_performance_from_predictions() | ||
| 3. Restructures output to match expected format | ||
| Args: | ||
| metric: The metric to evaluate. Defaults to self.target_metric when None. | ||
|
|
||
| Returns: | ||
| dict: Dictionary with performance values in the same format as get_performance() | ||
| """ |
There was a problem hiding this comment.
The updated docstring/return description is self-referential: it says the returned dict matches the format of get_performance(), but this method is get_performance(). Clarify what shape the returned dict has (keys like train_avg, dev_avg, etc.) and/or what legacy method it is intended to be compatible with.
| # Test scores using Octopus metrics for comparison | ||
| assert study_context.target_metric is not None, "target_metric should be set during fit()" | ||
| all_metrics = list(dict.fromkeys([*study_context.metrics, study_context.target_metric])) | ||
| all_metrics = Metrics.get_by_type(study_context.ml_type) |
There was a problem hiding this comment.
Switching from the explicit [...study_context.metrics, study_context.target_metric] list to Metrics.get_by_type(study_context.ml_type) will evaluate (and persist) every registered metric for the ML type. This can materially increase runtime and the size/noise of scores.parquet/performance_results.json, and it also hits the same downstream aggregation issue where multiple metrics in scores.parquet are not preserved. Consider restricting this comparison set to {study_context.target_metric} (or a small curated set) unless/until the rest of the pipeline is updated to handle multi-metric score artifacts.
No description provided.