Added batching to EIR computation #114

bzamanlooy · 2026-01-05T15:55:56Z

PR Type

[Feature | Fix | Documentation | Other ]

Fix

Short Description

Added batching to the computation of nearest neighbour from real to synthetic data.

Clickup Ticket(s): https://app.clickup.com/t/868gea30e

This PR addresses the memory usage issues faced when calculating the EIR. In the current version, I have just made sure that batching applies when computing the knn between synthetic and real data.

Tests Added

No tests added but two of the tests are impacted by the batch size.

coderabbitai · 2026-01-05T15:57:53Z

📝 Walkthrough

Walkthrough

This pull request introduces a new privacy metrics module (batched_eir.py) containing epsilon identifiability risk computation functions. The module includes a _column_entropy helper function, a batched_reference_knn function for memory-efficient k-nearest neighbor distance computation with configurable batching, and an EpsilonIdentifiability metric class that evaluates privacy risk by comparing internal REAL→REAL distances against external REAL→SYNTHETIC distances. An existing module is updated to import EpsilonIdentifiability from the new local implementation instead of an external syntheval dependency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding batching to EIR computation, which matches the primary objective of implementing memory-efficient batching for epsilon identifiability risk calculations.
Description check	✅ Passed	The description follows the template with PR Type, Short Description, and Tests Added sections completed. All required sections are present and adequately filled with relevant details about the fix and its impact.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Fix all issues with AI Agents 🤖

In @src/midst_toolkit/evaluation/privacy/batched_eir.py:
- Around line 11-14: In _column_entropy replace the unused unpacked variable
name "value" with "_" so the tuple from np.unique(np.round(labels),
return_counts=True) intentionally discards the first element; update the
unpacking in the _column_entropy function signature body where "value, counts =
..." to "_ , counts = ..." to remove the unused variable warning.

🧹 Nitpick comments (1)

src/midst_toolkit/evaluation/privacy/batched_eir.py (1)

17-71: Batching logic is correct and well-implemented.

The memory-efficient batching approach correctly:

Initializes tracking with infinity (line 49)

Iterates through reference data in configurable batches (lines 61-63)

Computes distances per batch using the existing _knn_distance utility (line 66)

Maintains the minimum distance across all batches (line 69)

The implementation includes helpful features like optional progress tracking and clear documentation.

Optional: Consider exposing batch_size configuration.

The default ref_batch_size=128 works well for typical use cases. However, if users need to tune batch size for extreme memory constraints or performance optimization, there's currently no way to configure it through the EpsilonIdentifiability class. This may be intentional to keep the API simple, but consider whether exposure through an optional constructor parameter would be valuable.

Verify test updates account for numerical differences.

The PR description mentions "two existing tests are impacted by the batch size." This suggests that batched computation may produce slightly different numerical results compared to the original implementation (due to floating-point arithmetic ordering). Please confirm that the affected tests have been reviewed and any tolerance adjustments are appropriate.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb9752b and 7cab2bf.

📒 Files selected for processing (2)

src/midst_toolkit/evaluation/privacy/batched_eir.py
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-11T16:08:49.024Z

Learnt from: lotif
Repo: VectorInstitute/midst-toolkit PR: 107
File: examples/gan/synthesize.py:1-47
Timestamp: 2025-12-11T16:08:49.024Z
Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

Applied to files:

src/midst_toolkit/evaluation/privacy/batched_eir.py
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

🧬 Code graph analysis (1)

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py (1)

src/midst_toolkit/evaluation/privacy/batched_eir.py (1)

EpsilonIdentifiability (74-168)

🪛 Ruff (0.14.10)

src/midst_toolkit/evaluation/privacy/batched_eir.py

13-13: Unpacked variable value is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: integration-tests
GitHub Check: unit-tests

🔇 Additional comments (5)

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py (1)

6-6: LGTM! Clean migration to local batched implementation.

The import change successfully moves from the external syntheval dependency to the new local batched implementation. The fact that no other code changes are required confirms that the new implementation maintains API compatibility.

src/midst_toolkit/evaluation/privacy/batched_eir.py (4)

1-9: LGTM! Appropriate imports for batched computation.

The imports properly include progress tracking (tqdm), numerical operations (numpy, scipy.stats), and reuse existing syntheval utilities (_knn_distance, MetricClass).

74-82: LGTM! Clean metric interface implementation.

The class properly implements the MetricClass interface with appropriate method signatures for name and type.

83-135: Evaluate logic correctly implements batched EIR computation.

The implementation makes intelligent choices about where to apply batching:

Correctly batched (lines 103-109, 122-128):

REAL→SYNTHETIC and HOUT→SYNTHETIC comparisons use batched_reference_knn to address memory concerns

Appropriately unbatched (lines 93-100, 117-119):

REAL→REAL and HOUT→HOUT comparisons use standard _knn_distance since these are self-comparisons where the utility function likely excludes self-matches efficiently

Computation correctness:

Column entropy weighting with numerical stability (lines 89-90) ✓

Identifiability as fraction where external distance < internal distance (line 112) ✓

Privacy loss as difference between training and holdout identifiability (line 133) ✓

137-168: LGTM! Output formatting methods are correct.

Both format_output() and normalize_output() properly handle the optional holdout data case and format results consistently.

src/midst_toolkit/evaluation/privacy/batched_eir.py

lotif

Lots of comments, but mostly on variable naming.

src/midst_toolkit/evaluation/privacy/batched_eir.py

lotif

Thanks for addressing all the comments! A few more minor things but otherwise it's good to go :)

src/midst_toolkit/evaluation/privacy/batched_eir.py

Behnoosh Zamanlooy and others added 3 commits January 5, 2026 10:48

Added batching to EIR computation

a2b1d13

remove comment

8d47f82

Merge branch 'main' into bz/eir

7cab2bf

bzamanlooy requested a review from emersodb January 5, 2026 15:56

coderabbitai bot reviewed Jan 5, 2026

View reviewed changes

src/midst_toolkit/evaluation/privacy/batched_eir.py Show resolved Hide resolved

Merge branch 'main' into bz/eir

16b489b

bzamanlooy commented Jan 5, 2026

View reviewed changes

src/midst_toolkit/evaluation/privacy/batched_eir.py Show resolved Hide resolved

bzamanlooy requested review from ElahehBassak, fatemetkl, lotif, masi-sh and sarakodeiri January 5, 2026 16:29

lotif requested changes Jan 6, 2026

View reviewed changes

bzamanlooy added 3 commits January 7, 2026 12:58

addressed Marcelo's comments and added improved readability

4bc23b3

make more redable.

c18d939

fixed ruff formatting

9a274d5

bzamanlooy requested a review from lotif January 8, 2026 18:54

lotif approved these changes Jan 8, 2026

View reviewed changes

src/midst_toolkit/evaluation/privacy/batched_eir.py Outdated Show resolved Hide resolved

src/midst_toolkit/evaluation/privacy/batched_eir.py Outdated Show resolved Hide resolved

src/midst_toolkit/evaluation/privacy/batched_eir.py Outdated Show resolved Hide resolved

bzamanlooy and others added 2 commits January 8, 2026 17:22

Minor edits

8e2e726

Merge branch 'main' into bz/eir

4165451

Added batching to EIR computation #114

Are you sure you want to change the base?

Added batching to EIR computation #114

Uh oh!

Conversation

bzamanlooy commented Jan 5, 2026

PR Type

Short Description

Tests Added

Uh oh!

coderabbitai bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Jan 5, 2026 •

edited

Loading