-
Notifications
You must be signed in to change notification settings - Fork 1
Implementation of Multi-Target Modeling Difference Metric #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR modifies the evaluation quality metrics module across multiple components. Changes include: a minor docstring clarification in MeanF1ScoreDifference; a refactoring of MeanRegressionDifference to accept either a file path or direct configuration object for regressors, along with runtime validation of label column dtypes; introduction of a new MultiTargetModelingDifference class that orchestrates regression and classification evaluations across multiple target columns with per-target configuration support; addition of test configuration and comprehensive unit tests for the new class; and corresponding test updates to reflect parameter naming changes. Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (1)
218-241: Consider adding cleanup and documenting expected exceptions.The exception tests are good for validation coverage. Consider adding
unset_all_random_seeds()at the end for consistency, even though these tests don't set seeds explicitly.src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (3)
149-166: Consider adding validation for required "regressors" key and explicit encoding.The config loading assumes the JSON always contains a "regressors" key (used as fallback in line 121), but doesn't validate this. Also, specifying encoding is a best practice for file operations.
Apply this diff:
def _get_regressors_specifications(self) -> dict[str, list[dict[str, Any]]]: """ Load the configurations file into a JSON structure. This can take two forms. The first is a set of regressors that will be applied for every classification task. These are specified at the top level of the config under the key "regressors." However, if a special set of regressors is desired for a particular column, the configuration can also include a key matching the target column with the same structure to include special settings for that specific column. NOTE: The configuration must always include a default set of configurations under the "regressors" key Returns: A dictionary with each entry being a list containing individual regression model configurations, including their sets of hyper-parameters to explore. The default set of regressors is under the "regressors" key. If any custom regressors were specified for individual columns, these are keyed by the column name to which they are to be applied. """ - with open(self.regressors_config_path, "r") as f: - return json.load(f) + with open(self.regressors_config_path, "r", encoding="utf-8") as f: + config = json.load(f) + assert "regressors" in config, ( + f"Configuration file {self.regressors_config_path} must contain a 'regressors' key" + ) + return config
116-122: Consider handling missing default "regressors" key more gracefully.If a config file lacks the "regressors" key and the label column isn't explicitly configured, this will raise a
KeyError. While_get_regressors_specificationsvalidation (if added per previous comment) would catch this earlier, an explicit check here would provide a clearer error message specific to the label column context.
240-256: Assertion message could be more accurate.The assertion at line 240 says "Regression analysis must have a holdout dataset", but this class handles both regression AND classification. Consider updating the message.
Apply this diff:
- assert holdout_data is not None, "Regression analysis must have a holdout dataset" + assert holdout_data is not None, "Multi-target analysis must have a holdout dataset"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/midst_toolkit/evaluation/quality/mean_f1_score_difference.py(1 hunks)src/midst_toolkit/evaluation/quality/mean_regression_difference.py(6 hunks)src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py(1 hunks)tests/assets/regression_config_2.json(1 hunks)tests/unit/evaluation/quality/test_mean_regression_difference.py(5 hunks)tests/unit/evaluation/quality/test_multi_target_modeling_difference.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (4)
src/midst_toolkit/common/enumerations.py (1)
ColumnType(37-41)src/midst_toolkit/evaluation/metrics_base.py (1)
SynthEvalMetric(33-103)src/midst_toolkit/evaluation/quality/mean_f1_score_difference.py (2)
MeanF1ScoreDifference(88-213)compute(143-213)src/midst_toolkit/evaluation/quality/mean_regression_difference.py (2)
MeanRegressionDifference(67-636)compute(537-636)
src/midst_toolkit/evaluation/quality/mean_regression_difference.py (1)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (1)
compute(193-276)
🪛 Ruff (0.14.6)
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py
18-18: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
28-28: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
40-40: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
57-57: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
59-59: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
66-66: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
68-68: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py
143-143: Avoid specifying long messages outside the exception class
(TRY003)
191-191: Avoid specifying long messages outside the exception class
(TRY003)
256-256: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: integration-tests
- GitHub Check: unit-tests
- GitHub Check: build
- GitHub Check: run-code-check
🔇 Additional comments (15)
src/midst_toolkit/evaluation/quality/mean_regression_difference.py (4)
75-77: LGTM! Clean API extension for dual-mode configuration.The parameter type union
Path | list[dict[str, Any]]enables both file-based and in-memory configuration, which is useful for programmatic usage and the newMultiTargetModelingDifferenceclass.
155-158: LGTM! Dual-mode loading is implemented correctly.The
isinstancecheck properly distinguishes between Path and list configurations. When a Path is provided, it loads from JSON; otherwise, it returns the config directly.
531-535: Good extraction of validation logic into a helper.This consolidates the dtype assertion that was previously repeated. The error message is informative.
587-589: LGTM! Validation applied consistently to all datasets.Validating dtype for real, synthetic, and holdout data ensures early failure with a clear message if the label column isn't properly typed.
tests/assets/regression_config_2.json (1)
12-17: LGTM! Per-target regressor configuration added.The
column_bkey enables testing the custom per-target regressor feature inMultiTargetModelingDifference. The structure matches the expected format used in_get_regressors_specifications().src/midst_toolkit/evaluation/quality/mean_f1_score_difference.py (1)
122-123: LGTM! Minor docstring wording improvement.Documentation-only change for consistency.
tests/unit/evaluation/quality/test_mean_regression_difference.py (1)
64-64: LGTM! Parameter name updated to match new API.Tests correctly updated to use
regressors_configinstead ofregressors_config_path, maintaining compatibility with the Path-based configuration loading.tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (2)
74-101: Good test structure with seeded randomness and cleanup.This test properly sets seeds, verifies holdout requirement via exception, and cleans up. Good coverage of single regression target behavior.
192-215: Good test for custom per-target regressor configuration.This test verifies that column-specific regressor configs from the JSON file are correctly applied, and that default regressors are not used when a custom config exists.
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (6)
14-22: LGTM! Clean type alias and metric filter definition.The
ModelBasedMetrictype alias improves readability, andMETRIC_FILTERcentralizes the list of averaged metric names for filtering.
99-106: Order of operations looks correct.
_get_regressors_specifications()is called beforelabel_columns_and_typeis assigned, but the method only readsself.regressors_config_pathwhich is set at line 103. The order is safe.
168-191: LGTM! Validation logic is clear and well-documented.The function properly validates column type consistency and filters out the label column from feature columns. The assertions provide clear error messages.
258-271: LGTM! Metric aggregation logic is correct.The aggregation properly:
- Computes mean regression differences per metric across numerical targets
- Computes mean F1 difference across categorical targets
- Combines both for the joint metrics
The comprehension at lines 266-270 correctly combines regression and F1 differences.
273-276: LGTM! Filtering logic works correctly.Using
startswith(METRIC_FILTER)with the tuple of metric prefixes correctly filters to only the averaged metrics wheninclude_regressor_specific_averagesis False.
25-98: Well-documented class with comprehensive docstring.The docstring clearly explains the purpose, behavior, and all parameters. The distinction between regression and classification targets is well-articulated.
bzamanlooy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one comment about an additional test.
The only think I might think would be an improvement later down the road is some sort of parallelization, since with the current setting, running a regression for each column takes 1-3 hours depending on the data size.
That's a good point. In your implementation that you've been working from, did you do any parallelization? If so, I can just borrow that. If not, it should certainly be doable, just need to take some time to do it right. |
…cal columns as suggested in PR review. Also removing stuff we don't need in the examples
PR Type
Feature
Short Description
Clickup Ticket(s): N/A
This PR adds a metric that allows one to measure classification and regression model performance differences for models trained on real and synthetic data across a list of target columns. A user can provide several target columns and their column type (numerical or categorical).
For numerical columns, regression models are trained to individually predict these columns from the remaining other columns provided. For categorical columns, classification models are trained and their F1 score measure.
NOTE: The columns are not predicted at the same time. A model is trained to predict each column INDIVIDUALLY and the results are averaged over all of the target columns.
Tests Added
Added unit tests to ensure that the multi-target metric works as expected.