Skip to content

Conversation

@emersodb
Copy link
Collaborator

PR Type

Feature

Short Description

Clickup Ticket(s): N/A

This PR adds a metric that allows one to measure classification and regression model performance differences for models trained on real and synthetic data across a list of target columns. A user can provide several target columns and their column type (numerical or categorical).

For numerical columns, regression models are trained to individually predict these columns from the remaining other columns provided. For categorical columns, classification models are trained and their F1 score measure.

NOTE: The columns are not predicted at the same time. A model is trained to predict each column INDIVIDUALLY and the results are averaged over all of the target columns.

Tests Added

Added unit tests to ensure that the multi-target metric works as expected.

@coderabbitai
Copy link

coderabbitai bot commented Nov 26, 2025

📝 Walkthrough

Walkthrough

This PR modifies the evaluation quality metrics module across multiple components. Changes include: a minor docstring clarification in MeanF1ScoreDifference; a refactoring of MeanRegressionDifference to accept either a file path or direct configuration object for regressors, along with runtime validation of label column dtypes; introduction of a new MultiTargetModelingDifference class that orchestrates regression and classification evaluations across multiple target columns with per-target configuration support; addition of test configuration and comprehensive unit tests for the new class; and corresponding test updates to reflect parameter naming changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • MeanRegressionDifference parameter refactoring: Dual-mode configuration loading logic (file path vs. direct object) requires careful validation to ensure both pathways work correctly and maintain backward compatibility semantics
  • MultiTargetModelingDifference orchestration logic: The compute method aggregates metrics across multiple targets (regression and classification); verify the aggregation formulas, handling of mixed target types, and per-regressor averaging logic
  • Validation and dtype checking: New private helper for label column dtype validation in MeanRegressionDifference; verify it catches intended errors without false positives
  • Test numeric assertions: Comprehensive unit tests with precise expected values across multiple scenarios; cross-check that numeric outputs match algorithmic intent
  • Configuration file handling: Verify _get_regressors_specifications properly loads per-target overrides from JSON and falls back to defaults correctly

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.48% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a new MultiTargetModelingDifference metric class for evaluating model performance differences across multiple target columns.
Description check ✅ Passed The description follows the required template with PR Type, Short Description, and Tests Added sections, providing clear context about the new metric's functionality and behavior.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dbe/compound_mle_evals

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (1)

218-241: Consider adding cleanup and documenting expected exceptions.

The exception tests are good for validation coverage. Consider adding unset_all_random_seeds() at the end for consistency, even though these tests don't set seeds explicitly.

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (3)

149-166: Consider adding validation for required "regressors" key and explicit encoding.

The config loading assumes the JSON always contains a "regressors" key (used as fallback in line 121), but doesn't validate this. Also, specifying encoding is a best practice for file operations.

Apply this diff:

     def _get_regressors_specifications(self) -> dict[str, list[dict[str, Any]]]:
         """
         Load the configurations file into a JSON structure. This can take two forms. The first is a set of regressors
         that will be applied for every classification task. These are specified at the top level of the config under
         the key "regressors." However, if a special set of regressors is desired for a particular column, the
         configuration can also include a key matching the target column with the same structure to include special
         settings for that specific column.

         NOTE: The configuration must always include a default set of configurations under the "regressors" key

         Returns:
             A dictionary with each entry being a list containing individual regression model configurations, including
             their sets of hyper-parameters to explore. The default set of regressors is under the "regressors" key. If
             any custom regressors were specified for individual columns, these are keyed by the column name to which
             they are to be applied.
         """
-        with open(self.regressors_config_path, "r") as f:
-            return json.load(f)
+        with open(self.regressors_config_path, "r", encoding="utf-8") as f:
+            config = json.load(f)
+        assert "regressors" in config, (
+            f"Configuration file {self.regressors_config_path} must contain a 'regressors' key"
+        )
+        return config

116-122: Consider handling missing default "regressors" key more gracefully.

If a config file lacks the "regressors" key and the label column isn't explicitly configured, this will raise a KeyError. While _get_regressors_specifications validation (if added per previous comment) would catch this earlier, an explicit check here would provide a clearer error message specific to the label column context.


240-256: Assertion message could be more accurate.

The assertion at line 240 says "Regression analysis must have a holdout dataset", but this class handles both regression AND classification. Consider updating the message.

Apply this diff:

-        assert holdout_data is not None, "Regression analysis must have a holdout dataset"
+        assert holdout_data is not None, "Multi-target analysis must have a holdout dataset"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02e7610 and 7dbff46.

📒 Files selected for processing (6)
  • src/midst_toolkit/evaluation/quality/mean_f1_score_difference.py (1 hunks)
  • src/midst_toolkit/evaluation/quality/mean_regression_difference.py (6 hunks)
  • src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (1 hunks)
  • tests/assets/regression_config_2.json (1 hunks)
  • tests/unit/evaluation/quality/test_mean_regression_difference.py (5 hunks)
  • tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (4)
src/midst_toolkit/common/enumerations.py (1)
  • ColumnType (37-41)
src/midst_toolkit/evaluation/metrics_base.py (1)
  • SynthEvalMetric (33-103)
src/midst_toolkit/evaluation/quality/mean_f1_score_difference.py (2)
  • MeanF1ScoreDifference (88-213)
  • compute (143-213)
src/midst_toolkit/evaluation/quality/mean_regression_difference.py (2)
  • MeanRegressionDifference (67-636)
  • compute (537-636)
src/midst_toolkit/evaluation/quality/mean_regression_difference.py (1)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (1)
  • compute (193-276)
🪛 Ruff (0.14.6)
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py

18-18: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


28-28: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


40-40: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


57-57: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


59-59: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


66-66: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


68-68: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py

143-143: Avoid specifying long messages outside the exception class

(TRY003)


191-191: Avoid specifying long messages outside the exception class

(TRY003)


256-256: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: integration-tests
  • GitHub Check: unit-tests
  • GitHub Check: build
  • GitHub Check: run-code-check
🔇 Additional comments (15)
src/midst_toolkit/evaluation/quality/mean_regression_difference.py (4)

75-77: LGTM! Clean API extension for dual-mode configuration.

The parameter type union Path | list[dict[str, Any]] enables both file-based and in-memory configuration, which is useful for programmatic usage and the new MultiTargetModelingDifference class.


155-158: LGTM! Dual-mode loading is implemented correctly.

The isinstance check properly distinguishes between Path and list configurations. When a Path is provided, it loads from JSON; otherwise, it returns the config directly.


531-535: Good extraction of validation logic into a helper.

This consolidates the dtype assertion that was previously repeated. The error message is informative.


587-589: LGTM! Validation applied consistently to all datasets.

Validating dtype for real, synthetic, and holdout data ensures early failure with a clear message if the label column isn't properly typed.

tests/assets/regression_config_2.json (1)

12-17: LGTM! Per-target regressor configuration added.

The column_b key enables testing the custom per-target regressor feature in MultiTargetModelingDifference. The structure matches the expected format used in _get_regressors_specifications().

src/midst_toolkit/evaluation/quality/mean_f1_score_difference.py (1)

122-123: LGTM! Minor docstring wording improvement.

Documentation-only change for consistency.

tests/unit/evaluation/quality/test_mean_regression_difference.py (1)

64-64: LGTM! Parameter name updated to match new API.

Tests correctly updated to use regressors_config instead of regressors_config_path, maintaining compatibility with the Path-based configuration loading.

tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (2)

74-101: Good test structure with seeded randomness and cleanup.

This test properly sets seeds, verifies holdout requirement via exception, and cleans up. Good coverage of single regression target behavior.


192-215: Good test for custom per-target regressor configuration.

This test verifies that column-specific regressor configs from the JSON file are correctly applied, and that default regressors are not used when a custom config exists.

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (6)

14-22: LGTM! Clean type alias and metric filter definition.

The ModelBasedMetric type alias improves readability, and METRIC_FILTER centralizes the list of averaged metric names for filtering.


99-106: Order of operations looks correct.

_get_regressors_specifications() is called before label_columns_and_type is assigned, but the method only reads self.regressors_config_path which is set at line 103. The order is safe.


168-191: LGTM! Validation logic is clear and well-documented.

The function properly validates column type consistency and filters out the label column from feature columns. The assertions provide clear error messages.


258-271: LGTM! Metric aggregation logic is correct.

The aggregation properly:

  1. Computes mean regression differences per metric across numerical targets
  2. Computes mean F1 difference across categorical targets
  3. Combines both for the joint metrics

The comprehension at lines 266-270 correctly combines regression and F1 differences.


273-276: LGTM! Filtering logic works correctly.

Using startswith(METRIC_FILTER) with the tuple of metric prefixes correctly filters to only the averaged metrics when include_regressor_specific_averages is False.


25-98: Well-documented class with comprehensive docstring.

The docstring clearly explains the purpose, behavior, and all parameters. The distinction between regression and classification targets is well-articulated.

Copy link
Collaborator

@bzamanlooy bzamanlooy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one comment about an additional test.

The only think I might think would be an improvement later down the road is some sort of parallelization, since with the current setting, running a regression for each column takes 1-3 hours depending on the data size.

@emersodb
Copy link
Collaborator Author

LGTM. Just one comment about an additional test.

The only think I might think would be an improvement later down the road is some sort of parallelization, since with the current setting, running a regression for each column takes 1-3 hours depending on the data size.

That's a good point. In your implementation that you've been working from, did you do any parallelization? If so, I can just borrow that. If not, it should certainly be doable, just need to take some time to do it right.

…cal columns as suggested in PR review. Also removing stuff we don't need in the examples
@emersodb emersodb merged commit 414c704 into main Nov 28, 2025
6 checks passed
@emersodb emersodb deleted the dbe/compound_mle_evals branch November 28, 2025 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants