-
Notifications
You must be signed in to change notification settings - Fork 1
Parallelizing the computation of the multi-target model difference metric #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| assert pytest.approx(-0.034541865204139155, abs=1e-8) == score["avg_r2_difference"] | ||
| assert pytest.approx(0.9958604761438379, abs=1e-8) == score["avg_mean_squared_error_difference"] | ||
| assert pytest.approx(-0.03548026558881601, abs=1e-8) == score["avg_r2_difference"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these metrics change a bit based on the way I chose to address the issues with determinism in multiprocess systems.
| The set of computed regression or classification metrics (depending on the column type) that were computed. | ||
| """ | ||
| label_column_type, metric, seed = column_type_metric_seed | ||
| set_all_random_seeds(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this, but multiprocessing and random seeds do not play nice together. Without this, I cannot seem to get determinism due to the way seeds are funnelled into the various processes etc.
If anyone has a better idea as to how to do this better, I'm very open to it 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the solution but I looked around a bit and one of the main suggestions is to use mocks which I guess defeats the purpose of what you want to do. I also found this paper: Multtestlib: A Python package for performing unit tests using multiprocessing that tries to address this but not sure if it is worth it for us. Perhaps @lotif has an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth us looking into the Multtestlib library at some point, but at first glance, it seems more geared towards speeding up testing by distributing their compute? No doubt they must have some tooling to help with determinism though. So it's possible we might want to integrate it if we keep running into this issue.
Mocks would definitely sort of help but I really do want to fully exercise the MP components. So I'm worried mocking will shortcut some of this and make us think we have a thorough test when its not quite doing everything.
📝 WalkthroughWalkthroughThis pull request adds multiprocessing support to the multi-target modeling difference evaluation module. A new runtime dependency Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)
84-84: Consider validatingn_jobsto prevent unexpected behavior.The
n_jobsparameter lacks validation. If a user passesn_jobs=0or a negative value,Pool(0)orPool(-1)may behave unexpectedly (negative values typically mean "use all CPUs" in some libraries like scikit-learn, but Python'sPooldoesn't follow this convention).super().__init__(categorical_columns, numerical_columns, do_preprocess) - self.n_jobs = n_jobs + if n_jobs < 1: + raise ValueError("n_jobs must be at least 1") + self.n_jobs = n_jobsAlso applies to: 151-152
303-321: Parallelization approach is sound, but be aware of serialization overhead.The implementation correctly:
- Generates unique seeds per task for reproducibility
- Uses
pool.mapwhich preserves result ordering- Avoids Pool overhead when
n_jobs=1However, the DataFrames (
real_data,synthetic_data,holdout_data) are serialized and sent to each worker. For very large datasets, this serialization overhead may reduce the benefits of parallelization. If performance becomes an issue with large datasets, consider shared memory approaches.Also, since you've added
multiprocess>=0.70.18as a dependency (which usesdillfor better serialization), you may want to import frommultiprocessinstead of the standard librarymultiprocessingfor consistency and to leveragedill's enhanced pickling capabilities.-from multiprocessing import Pool +from multiprocess import Pool
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
pyproject.toml(1 hunks)src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py(5 hunks)tests/unit/evaluation/quality/test_multi_target_modeling_difference.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)
src/midst_toolkit/common/enumerations.py (1)
ColumnType(37-41)src/midst_toolkit/common/random.py (1)
set_all_random_seeds(11-55)
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (4)
src/midst_toolkit/common/random.py (2)
set_all_random_seeds(11-55)unset_all_random_seeds(58-67)src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)
MultiTargetModelingDifference(71-349)compute(251-349)src/midst_toolkit/common/enumerations.py (1)
ColumnType(37-41)src/midst_toolkit/evaluation/quality/mean_regression_difference.py (1)
compute(537-636)
🪛 Ruff (0.14.6)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py
68-68: Avoid specifying long messages outside the exception class
(TRY003)
305-305: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: unit-tests
- GitHub Check: run-code-check
- GitHub Check: integration-tests
🔇 Additional comments (4)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (3)
2-5: LGTM!The imports are appropriate for the parallelization feature. The
partialfunction is well-suited for creating the callable with fixed dataframe arguments for the Pool.
29-68: Well-structured module-level function for multiprocessing compatibility.The function is correctly defined at module level (required for pickle serialization) and properly seeds randomness per-process. The defensive
.copy()calls on DataFrames ensure isolation between parallel tasks.Regarding the static analysis hint (S311): Using
random.randbytesfor seeding ML computations is appropriate—it's not used for cryptographic purposes.
323-329: LGTM!The post-processing correctly separates F1 differences (categorical targets) from regression metrics (numerical targets) and accumulates them for averaging.
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (1)
95-100: Verify that updated expected values are correct after refactoring.The expected metric values have changed. This is likely due to the new per-label seeding approach in the refactored code. Please confirm these new values are the expected behavior and not a regression introduced by the parallelization changes.
bzamanlooy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me and I think if we don't want to further investigate the testing for multiprocessing, the PR is good to go but I added some comments about what I found looking around but not sure if they are good ideas.
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py
Outdated
Show resolved
Hide resolved
| The set of computed regression or classification metrics (depending on the column type) that were computed. | ||
| """ | ||
| label_column_type, metric, seed = column_type_metric_seed | ||
| set_all_random_seeds(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the solution but I looked around a bit and one of the main suggestions is to use mocks which I guess defeats the purpose of what you want to do. I also found this paper: Multtestlib: A Python package for performing unit tests using multiprocessing that tries to address this but not sure if it is worth it for us. Perhaps @lotif has an idea.
…P and I can't reproduce locally
f92cbf4 to
6da7378
Compare
lotif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things to make it better.
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py
Show resolved
Hide resolved
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py
Show resolved
Hide resolved
lotif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments :)
Of course! This one was a bit weird with the randomness stuff. I'm surprised it's not a solved issue by now. |
PR Type
Feature
Short Description
Clickup Ticket(s): https://app.clickup.com/t/868ghd33j
The multi-target model difference metric can be quite heavy to compute in serial, especially when you have a lot of data and many target columns to iterate over. This PR adds optional CPU parallelization to the computation of the metric.
Tests Added
Added a test applying parallelization and ensuring that it produces the same result over multiple runs.