Skip to content

Conversation

@bzamanlooy
Copy link
Collaborator

PR Type

[Feature | Fix | Documentation | Other ]

Fix

Short Description

Added batching to the computation of nearest neighbour from real to synthetic data.

Clickup Ticket(s): https://app.clickup.com/t/868gea30e

This PR addresses the memory usage issues faced when calculating the EIR. In the current version, I have just made sure that batching applies when computing the knn between synthetic and real data.

Tests Added

No tests added but two of the tests are impacted by the batch size.

@bzamanlooy bzamanlooy requested a review from emersodb January 5, 2026 15:56
@coderabbitai
Copy link

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a new privacy metrics module (batched_eir.py) containing epsilon identifiability risk computation functions. The module includes a _column_entropy helper function, a batched_reference_knn function for memory-efficient k-nearest neighbor distance computation with configurable batching, and an EpsilonIdentifiability metric class that evaluates privacy risk by comparing internal REAL→REAL distances against external REAL→SYNTHETIC distances. An existing module is updated to import EpsilonIdentifiability from the new local implementation instead of an external syntheval dependency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding batching to EIR computation, which matches the primary objective of implementing memory-efficient batching for epsilon identifiability risk calculations.
Description check ✅ Passed The description follows the template with PR Type, Short Description, and Tests Added sections completed. All required sections are present and adequately filled with relevant details about the fix and its impact.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents 🤖
In @src/midst_toolkit/evaluation/privacy/batched_eir.py:
- Around line 11-14: In _column_entropy replace the unused unpacked variable
name "value" with "_" so the tuple from np.unique(np.round(labels),
return_counts=True) intentionally discards the first element; update the
unpacking in the _column_entropy function signature body where "value, counts =
..." to "_ , counts = ..." to remove the unused variable warning.
🧹 Nitpick comments (1)
src/midst_toolkit/evaluation/privacy/batched_eir.py (1)

17-71: Batching logic is correct and well-implemented.

The memory-efficient batching approach correctly:

  • Initializes tracking with infinity (line 49)
  • Iterates through reference data in configurable batches (lines 61-63)
  • Computes distances per batch using the existing _knn_distance utility (line 66)
  • Maintains the minimum distance across all batches (line 69)

The implementation includes helpful features like optional progress tracking and clear documentation.

Optional: Consider exposing batch_size configuration.

The default ref_batch_size=128 works well for typical use cases. However, if users need to tune batch size for extreme memory constraints or performance optimization, there's currently no way to configure it through the EpsilonIdentifiability class. This may be intentional to keep the API simple, but consider whether exposure through an optional constructor parameter would be valuable.

Verify test updates account for numerical differences.

The PR description mentions "two existing tests are impacted by the batch size." This suggests that batched computation may produce slightly different numerical results compared to the original implementation (due to floating-point arithmetic ordering). Please confirm that the affected tests have been reviewed and any tolerance adjustments are appropriate.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb9752b and 7cab2bf.

📒 Files selected for processing (2)
  • src/midst_toolkit/evaluation/privacy/batched_eir.py
  • src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-11T16:08:49.024Z
Learnt from: lotif
Repo: VectorInstitute/midst-toolkit PR: 107
File: examples/gan/synthesize.py:1-47
Timestamp: 2025-12-11T16:08:49.024Z
Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

Applied to files:

  • src/midst_toolkit/evaluation/privacy/batched_eir.py
  • src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py
🧬 Code graph analysis (1)
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py (1)
src/midst_toolkit/evaluation/privacy/batched_eir.py (1)
  • EpsilonIdentifiability (74-168)
🪛 Ruff (0.14.10)
src/midst_toolkit/evaluation/privacy/batched_eir.py

13-13: Unpacked variable value is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: integration-tests
  • GitHub Check: unit-tests
🔇 Additional comments (5)
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py (1)

6-6: LGTM! Clean migration to local batched implementation.

The import change successfully moves from the external syntheval dependency to the new local batched implementation. The fact that no other code changes are required confirms that the new implementation maintains API compatibility.

src/midst_toolkit/evaluation/privacy/batched_eir.py (4)

1-9: LGTM! Appropriate imports for batched computation.

The imports properly include progress tracking (tqdm), numerical operations (numpy, scipy.stats), and reuse existing syntheval utilities (_knn_distance, MetricClass).


74-82: LGTM! Clean metric interface implementation.

The class properly implements the MetricClass interface with appropriate method signatures for name and type.


83-135: Evaluate logic correctly implements batched EIR computation.

The implementation makes intelligent choices about where to apply batching:

Correctly batched (lines 103-109, 122-128):

  • REAL→SYNTHETIC and HOUT→SYNTHETIC comparisons use batched_reference_knn to address memory concerns

Appropriately unbatched (lines 93-100, 117-119):

  • REAL→REAL and HOUT→HOUT comparisons use standard _knn_distance since these are self-comparisons where the utility function likely excludes self-matches efficiently

Computation correctness:

  • Column entropy weighting with numerical stability (lines 89-90) ✓
  • Identifiability as fraction where external distance < internal distance (line 112) ✓
  • Privacy loss as difference between training and holdout identifiability (line 133) ✓

137-168: LGTM! Output formatting methods are correct.

Both format_output() and normalize_output() properly handle the optional holdout data case and format results consistently.

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, but mostly on variable naming.

@bzamanlooy bzamanlooy requested a review from lotif January 8, 2026 18:54
Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments! A few more minor things but otherwise it's good to go :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants