Skip to content

Conversation

@emersodb
Copy link
Collaborator

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868ghd33j

The multi-target model difference metric can be quite heavy to compute in serial, especially when you have a lot of data and many target columns to iterate over. This PR adds optional CPU parallelization to the computation of the metric.

Tests Added

Added a test applying parallelization and ensuring that it produces the same result over multiple runs.


assert pytest.approx(-0.034541865204139155, abs=1e-8) == score["avg_r2_difference"]
assert pytest.approx(0.9958604761438379, abs=1e-8) == score["avg_mean_squared_error_difference"]
assert pytest.approx(-0.03548026558881601, abs=1e-8) == score["avg_r2_difference"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these metrics change a bit based on the way I chose to address the issues with determinism in multiprocess systems.

The set of computed regression or classification metrics (depending on the column type) that were computed.
"""
label_column_type, metric, seed = column_type_metric_seed
set_all_random_seeds(seed)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this, but multiprocessing and random seeds do not play nice together. Without this, I cannot seem to get determinism due to the way seeds are funnelled into the various processes etc.

If anyone has a better idea as to how to do this better, I'm very open to it 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the solution but I looked around a bit and one of the main suggestions is to use mocks which I guess defeats the purpose of what you want to do. I also found this paper: Multtestlib: A Python package for performing unit tests using multiprocessing that tries to address this but not sure if it is worth it for us. Perhaps @lotif has an idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth us looking into the Multtestlib library at some point, but at first glance, it seems more geared towards speeding up testing by distributing their compute? No doubt they must have some tooling to help with determinism though. So it's possible we might want to integrate it if we keep running into this issue.

Mocks would definitely sort of help but I really do want to fully exercise the MP components. So I'm worried mocking will shortcut some of this and make us think we have a thorough test when its not quite doing everything.

@coderabbitai
Copy link

coderabbitai bot commented Nov 28, 2025

📝 Walkthrough

Walkthrough

This pull request adds multiprocessing support to the multi-target modeling difference evaluation module. A new runtime dependency multiprocess>=0.70.18 is introduced to pyproject.toml. The MultiTargetModelingDifference class is enhanced with an n_jobs parameter enabling parallel evaluation of target columns via a new compute_for_single_label function. The implementation maintains sequential processing when n_jobs=1 and switches to parallel Pool-based execution when n_jobs>1, with per-label seeding for deterministic results. Tests are updated with revised metric values and a new test validates parallel execution behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Multiprocessing implementation: Verify proper Pool management, process lifecycle, and resource cleanup in compute_for_single_label and MultiTargetModelingDifference.evaluate
  • Seeding and determinism: Confirm per-label seed distribution enables reproducible results across parallel processes
  • Metric aggregation logic: Review changes to how categorical (f1_difference) and numerical metrics are separated and combined
  • Test value updates: Validate that new expected metric values are correct and not inadvertently masked by test updates
  • Parallel vs. sequential parity: Ensure behavioral equivalence between n_jobs=1 and n_jobs>1 execution paths

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding parallelization to the multi-target model difference metric computation.
Description check ✅ Passed The PR description follows the template with PR Type, Short Description (including ClickUp ticket), and Tests Added sections all properly completed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dbe/parallelize_multi_target_metric

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)

84-84: Consider validating n_jobs to prevent unexpected behavior.

The n_jobs parameter lacks validation. If a user passes n_jobs=0 or a negative value, Pool(0) or Pool(-1) may behave unexpectedly (negative values typically mean "use all CPUs" in some libraries like scikit-learn, but Python's Pool doesn't follow this convention).

         super().__init__(categorical_columns, numerical_columns, do_preprocess)
 
-        self.n_jobs = n_jobs
+        if n_jobs < 1:
+            raise ValueError("n_jobs must be at least 1")
+        self.n_jobs = n_jobs

Also applies to: 151-152


303-321: Parallelization approach is sound, but be aware of serialization overhead.

The implementation correctly:

  • Generates unique seeds per task for reproducibility
  • Uses pool.map which preserves result ordering
  • Avoids Pool overhead when n_jobs=1

However, the DataFrames (real_data, synthetic_data, holdout_data) are serialized and sent to each worker. For very large datasets, this serialization overhead may reduce the benefits of parallelization. If performance becomes an issue with large datasets, consider shared memory approaches.

Also, since you've added multiprocess>=0.70.18 as a dependency (which uses dill for better serialization), you may want to import from multiprocess instead of the standard library multiprocessing for consistency and to leverage dill's enhanced pickling capabilities.

-from multiprocessing import Pool
+from multiprocess import Pool
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 414c704 and fc10d30.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • pyproject.toml (1 hunks)
  • src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (5 hunks)
  • tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)
src/midst_toolkit/common/enumerations.py (1)
  • ColumnType (37-41)
src/midst_toolkit/common/random.py (1)
  • set_all_random_seeds (11-55)
tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (4)
src/midst_toolkit/common/random.py (2)
  • set_all_random_seeds (11-55)
  • unset_all_random_seeds (58-67)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (2)
  • MultiTargetModelingDifference (71-349)
  • compute (251-349)
src/midst_toolkit/common/enumerations.py (1)
  • ColumnType (37-41)
src/midst_toolkit/evaluation/quality/mean_regression_difference.py (1)
  • compute (537-636)
🪛 Ruff (0.14.6)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py

68-68: Avoid specifying long messages outside the exception class

(TRY003)


305-305: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: unit-tests
  • GitHub Check: run-code-check
  • GitHub Check: integration-tests
🔇 Additional comments (4)
src/midst_toolkit/evaluation/quality/multi_target_modeling_difference.py (3)

2-5: LGTM!

The imports are appropriate for the parallelization feature. The partial function is well-suited for creating the callable with fixed dataframe arguments for the Pool.


29-68: Well-structured module-level function for multiprocessing compatibility.

The function is correctly defined at module level (required for pickle serialization) and properly seeds randomness per-process. The defensive .copy() calls on DataFrames ensure isolation between parallel tasks.

Regarding the static analysis hint (S311): Using random.randbytes for seeding ML computations is appropriate—it's not used for cryptographic purposes.


323-329: LGTM!

The post-processing correctly separates F1 differences (categorical targets) from regression metrics (numerical targets) and accumulates them for averaging.

tests/unit/evaluation/quality/test_multi_target_modeling_difference.py (1)

95-100: Verify that updated expected values are correct after refactoring.

The expected metric values have changed. This is likely due to the new per-label seeding approach in the refactored code. Please confirm these new values are the expected behavior and not a regression introduced by the parallelization changes.

Copy link
Collaborator

@bzamanlooy bzamanlooy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me and I think if we don't want to further investigate the testing for multiprocessing, the PR is good to go but I added some comments about what I found looking around but not sure if they are good ideas.

The set of computed regression or classification metrics (depending on the column type) that were computed.
"""
label_column_type, metric, seed = column_type_metric_seed
set_all_random_seeds(seed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the solution but I looked around a bit and one of the main suggestions is to use mocks which I guess defeats the purpose of what you want to do. I also found this paper: Multtestlib: A Python package for performing unit tests using multiprocessing that tries to address this but not sure if it is worth it for us. Perhaps @lotif has an idea.

@emersodb emersodb force-pushed the dbe/parallelize_multi_target_metric branch from f92cbf4 to 6da7378 Compare December 1, 2025 14:55
Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things to make it better.

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments :)

@emersodb
Copy link
Collaborator Author

emersodb commented Dec 2, 2025

Thanks for addressing the comments :)

Of course! This one was a bit weird with the randomness stuff. I'm surprised it's not a solved issue by now.

@emersodb emersodb merged commit 6eb18e9 into main Dec 2, 2025
6 checks passed
@emersodb emersodb deleted the dbe/parallelize_multi_target_metric branch December 2, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants