Parallelizing the computation of the multi-target model difference metric #102

emersodb · 2025-11-28T22:32:50Z

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868ghd33j

The multi-target model difference metric can be quite heavy to compute in serial, especially when you have a lot of data and many target columns to iterate over. This PR adds optional CPU parallelization to the computation of the metric.

Tests Added

Added a test applying parallelization and ensuring that it produces the same result over multiple runs.

emersodb · 2025-11-28T22:33:34Z

tests/unit/evaluation/quality/test_multi_target_modeling_difference.py


-    assert pytest.approx(-0.034541865204139155, abs=1e-8) == score["avg_r2_difference"]
-    assert pytest.approx(0.9958604761438379, abs=1e-8) == score["avg_mean_squared_error_difference"]
+    assert pytest.approx(-0.03548026558881601, abs=1e-8) == score["avg_r2_difference"]


Some of these metrics change a bit based on the way I chose to address the issues with determinism in multiprocess systems.

emersodb · 2025-11-28T22:34:55Z

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py

+        The set of computed regression or classification metrics (depending on the column type) that were computed.
+    """
+    label_column_type, metric, seed = column_type_metric_seed
+    set_all_random_seeds(seed)


I don't love this, but multiprocessing and random seeds do not play nice together. Without this, I cannot seem to get determinism due to the way seeds are funnelled into the various processes etc.

If anyone has a better idea as to how to do this better, I'm very open to it 🙂

I don't know the solution but I looked around a bit and one of the main suggestions is to use mocks which I guess defeats the purpose of what you want to do. I also found this paper: Multtestlib: A Python package for performing unit tests using multiprocessing that tries to address this but not sure if it is worth it for us. Perhaps @lotif has an idea.

I think it's worth us looking into the Multtestlib library at some point, but at first glance, it seems more geared towards speeding up testing by distributing their compute? No doubt they must have some tooling to help with determinism though. So it's possible we might want to integrate it if we keep running into this issue.

Mocks would definitely sort of help but I really do want to fully exercise the MP components. So I'm worried mocking will shortcut some of this and make us think we have a thorough test when its not quite doing everything.

coderabbitai · 2025-11-28T22:36:11Z

📝 Walkthrough

Walkthrough

This pull request adds multiprocessing support to the multi-target modeling difference evaluation module. A new runtime dependency multiprocess>=0.70.18 is introduced to pyproject.toml. The MultiTargetModelingDifference class is enhanced with an n_jobs parameter enabling parallel evaluation of target columns via a new compute_for_single_label function. The implementation maintains sequential processing when n_jobs=1 and switches to parallel Pool-based execution when n_jobs>1, with per-label seeding for deterministic results. Tests are updated with revised metric values and a new test validates parallel execution behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Multiprocessing implementation: Verify proper Pool management, process lifecycle, and resource cleanup in compute_for_single_label and MultiTargetModelingDifference.evaluate
Seeding and determinism: Confirm per-label seed distribution enables reproducible results across parallel processes
Metric aggregation logic: Review changes to how categorical (f1_difference) and numerical metrics are separated and combined
Test value updates: Validate that new expected metric values are correct and not inadvertently masked by test updates
Parallel vs. sequential parity: Ensure behavioral equivalence between n_jobs=1 and n_jobs>1 execution paths

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main change: adding parallelization to the multi-target model difference metric computation.
Description check	✅ Passed	The PR description follows the template with PR Type, Short Description (including ClickUp ticket), and Tests Added sections all properly completed.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dbe/parallelize_multi_target_metric

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)
84-84: Consider validating n_jobs to prevent unexpected behavior.

The n_jobs parameter lacks validation. If a user passes n_jobs=0 or a negative value, Pool(0) or Pool(-1) may behave unexpectedly (negative values typically mean "use all CPUs" in some libraries like scikit-learn, but Python's Pool doesn't follow this convention).
         super().__init__(categorical_columns, numerical_columns, do_preprocess)
 
-        self.n_jobs = n_jobs
+        if n_jobs < 1:
+            raise ValueError("n_jobs must be at least 1")
+        self.n_jobs = n_jobs
Also applies to: 151-152

303-321: Parallelization approach is sound, but be aware of serialization overhead.

The implementation correctly:

Generates unique seeds per task for reproducibility

Uses pool.map which preserves result ordering

Avoids Pool overhead when n_jobs=1

However, the DataFrames (real_data, synthetic_data, holdout_data) are serialized and sent to each worker. For very large datasets, this serialization overhead may reduce the benefits of parallelization. If performance becomes an issue with large datasets, consider shared memory approaches.

Also, since you've added multiprocess>=0.70.18 as a dependency (which uses dill for better serialization), you may want to import from multiprocess instead of the standard library multiprocessing for consistency and to leverage dill's enhanced pickling capabilities.
-from multiprocessing import Pool
+from multiprocess import Pool

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 414c704 and fc10d30.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

pyproject.toml (1 hunks)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (5 hunks)
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)

src/midst_toolkit/common/enumerations.py (1)

ColumnType (37-41)

src/midst_toolkit/common/random.py (1)

set_all_random_seeds (11-55)

tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (4)

src/midst_toolkit/common/random.py (2)

set_all_random_seeds (11-55)

unset_all_random_seeds (58-67)

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)

MultiTargetModelingDifference (71-349)

compute (251-349)

src/midst_toolkit/common/enumerations.py (1)

ColumnType (37-41)

src/midst_toolkit/evaluation/quality/mean_regression_difference.py (1)

compute (537-636)

🪛 Ruff (0.14.6)

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py

68-68: Avoid specifying long messages outside the exception class

(TRY003)

305-305: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: unit-tests
GitHub Check: run-code-check
GitHub Check: integration-tests

🔇 Additional comments (4)

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (3)

2-5: LGTM!

The imports are appropriate for the parallelization feature. The partial function is well-suited for creating the callable with fixed dataframe arguments for the Pool.

29-68: Well-structured module-level function for multiprocessing compatibility.

The function is correctly defined at module level (required for pickle serialization) and properly seeds randomness per-process. The defensive .copy() calls on DataFrames ensure isolation between parallel tasks.

Regarding the static analysis hint (S311): Using random.randbytes for seeding ML computations is appropriate—it's not used for cryptographic purposes.

323-329: LGTM!

The post-processing correctly separates F1 differences (categorical targets) from regression metrics (numerical targets) and accumulates them for averaging.

tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (1)

95-100: Verify that updated expected values are correct after refactoring.

The expected metric values have changed. This is likely due to the new per-label seeding approach in the refactored code. Please confirm these new values are the expected behavior and not a regression introduced by the parallelization changes.

pyproject.toml

tests/unit/evaluation/quality/test_multi_target_modeling_difference.py

bzamanlooy

It looks good to me and I think if we don't want to further investigate the testing for multiprocessing, the PR is good to go but I added some comments about what I found looking around but not sure if they are good ideas.

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py

bzamanlooy · 2025-11-29T18:00:02Z

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py

+        The set of computed regression or classification metrics (depending on the column type) that were computed.
+    """
+    label_column_type, metric, seed = column_type_metric_seed
+    set_all_random_seeds(seed)


I don't know the solution but I looked around a bit and one of the main suggestions is to use mocks which I guess defeats the purpose of what you want to do. I also found this paper: Multtestlib: A Python package for performing unit tests using multiprocessing that tries to address this but not sure if it is worth it for us. Perhaps @lotif has an idea.

…l setting

…P and I can't reproduce locally

lotif

A couple of things to make it better.

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py

lotif

Thanks for addressing the comments :)

emersodb · 2025-12-02T22:11:17Z

Thanks for addressing the comments :)

Of course! This one was a bit weird with the randomness stuff. I'm surprised it's not a solved issue by now.

emersodb requested review from ElahehBassak, amrit110, bzamanlooy, fatemetkl, lotif, masi-sh and sarakodeiri November 28, 2025 22:32

emersodb commented Nov 28, 2025

View reviewed changes

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

tests/unit/evaluation/quality/test_multi_target_modeling_difference.py Show resolved Hide resolved

bzamanlooy reviewed Nov 29, 2025

View reviewed changes

emersodb added 8 commits December 1, 2025 09:48

Intermediate checkin

6b43a85

Adding tests and fixing some of the determinism issues in the paralle…

1d3d155

…l setting

Coderabbit comments

220793a

Marking test not to run on remote because github seems to hang with M…

d82bd88

…P and I can't reproduce locally

Testing a fix for pytests hanging on linux machines when run as a suite

7da5579

Moving linux fix and better comment

bbb654f

Removing some unnecessary code

5730573

Switching to recommended 'in code' fix to the mp locking issue

6da7378

emersodb force-pushed the dbe/parallelize_multi_target_metric branch from f92cbf4 to 6da7378 Compare December 1, 2025 14:55

lotif reviewed Dec 1, 2025

View reviewed changes

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py Show resolved Hide resolved

src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py Show resolved Hide resolved

emersodb added 3 commits December 1, 2025 16:47

PR Comments

59f2736

Addressing vulnerability

625caf5

One more suggestion and fixing tests thereafter

b27b05a

lotif approved these changes Dec 2, 2025

View reviewed changes

emersodb merged commit 6eb18e9 into main Dec 2, 2025
6 checks passed

emersodb deleted the dbe/parallelize_multi_target_metric branch December 2, 2025 22:11

Parallelizing the computation of the multi-target model difference metric #102

Parallelizing the computation of the multi-target model difference metric #102

Uh oh!

Conversation

emersodb commented Nov 28, 2025

PR Type

Short Description

Tests Added

Uh oh!

emersodb Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

emersodb Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

bzamanlooy Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

emersodb Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 28, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bzamanlooy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bzamanlooy Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

emersodb commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants