-
Notifications
You must be signed in to change notification settings - Fork 1
Added batching to EIR computation #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a new privacy metrics module ( Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Fix all issues with AI Agents 🤖
In @src/midst_toolkit/evaluation/privacy/batched_eir.py:
- Around line 11-14: In _column_entropy replace the unused unpacked variable
name "value" with "_" so the tuple from np.unique(np.round(labels),
return_counts=True) intentionally discards the first element; update the
unpacking in the _column_entropy function signature body where "value, counts =
..." to "_ , counts = ..." to remove the unused variable warning.
🧹 Nitpick comments (1)
src/midst_toolkit/evaluation/privacy/batched_eir.py (1)
17-71: Batching logic is correct and well-implemented.The memory-efficient batching approach correctly:
- Initializes tracking with infinity (line 49)
- Iterates through reference data in configurable batches (lines 61-63)
- Computes distances per batch using the existing
_knn_distanceutility (line 66)- Maintains the minimum distance across all batches (line 69)
The implementation includes helpful features like optional progress tracking and clear documentation.
Optional: Consider exposing batch_size configuration.
The default
ref_batch_size=128works well for typical use cases. However, if users need to tune batch size for extreme memory constraints or performance optimization, there's currently no way to configure it through theEpsilonIdentifiabilityclass. This may be intentional to keep the API simple, but consider whether exposure through an optional constructor parameter would be valuable.Verify test updates account for numerical differences.
The PR description mentions "two existing tests are impacted by the batch size." This suggests that batched computation may produce slightly different numerical results compared to the original implementation (due to floating-point arithmetic ordering). Please confirm that the affected tests have been reviewed and any tolerance adjustments are appropriate.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/midst_toolkit/evaluation/privacy/batched_eir.pysrc/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-11T16:08:49.024Z
Learnt from: lotif
Repo: VectorInstitute/midst-toolkit PR: 107
File: examples/gan/synthesize.py:1-47
Timestamp: 2025-12-11T16:08:49.024Z
Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.
Applied to files:
src/midst_toolkit/evaluation/privacy/batched_eir.pysrc/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py
🧬 Code graph analysis (1)
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py (1)
src/midst_toolkit/evaluation/privacy/batched_eir.py (1)
EpsilonIdentifiability(74-168)
🪛 Ruff (0.14.10)
src/midst_toolkit/evaluation/privacy/batched_eir.py
13-13: Unpacked variable value is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: integration-tests
- GitHub Check: unit-tests
🔇 Additional comments (5)
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py (1)
6-6: LGTM! Clean migration to local batched implementation.The import change successfully moves from the external
synthevaldependency to the new local batched implementation. The fact that no other code changes are required confirms that the new implementation maintains API compatibility.src/midst_toolkit/evaluation/privacy/batched_eir.py (4)
1-9: LGTM! Appropriate imports for batched computation.The imports properly include progress tracking (
tqdm), numerical operations (numpy,scipy.stats), and reuse existingsynthevalutilities (_knn_distance,MetricClass).
74-82: LGTM! Clean metric interface implementation.The class properly implements the
MetricClassinterface with appropriate method signatures for name and type.
83-135: Evaluate logic correctly implements batched EIR computation.The implementation makes intelligent choices about where to apply batching:
Correctly batched (lines 103-109, 122-128):
- REAL→SYNTHETIC and HOUT→SYNTHETIC comparisons use
batched_reference_knnto address memory concernsAppropriately unbatched (lines 93-100, 117-119):
- REAL→REAL and HOUT→HOUT comparisons use standard
_knn_distancesince these are self-comparisons where the utility function likely excludes self-matches efficientlyComputation correctness:
- Column entropy weighting with numerical stability (lines 89-90) ✓
- Identifiability as fraction where external distance < internal distance (line 112) ✓
- Privacy loss as difference between training and holdout identifiability (line 133) ✓
137-168: LGTM! Output formatting methods are correct.Both
format_output()andnormalize_output()properly handle the optional holdout data case and format results consistently.
lotif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments, but mostly on variable naming.
lotif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the comments! A few more minor things but otherwise it's good to go :)
PR Type
[Feature | Fix | Documentation | Other ]
Fix
Short Description
Added batching to the computation of nearest neighbour from real to synthetic data.
Clickup Ticket(s): https://app.clickup.com/t/868gea30e
This PR addresses the memory usage issues faced when calculating the EIR. In the current version, I have just made sure that batching applies when computing the knn between synthetic and real data.
Tests Added
No tests added but two of the tests are impacted by the batch size.