Add Privacy Risk Assessment Module with T-Closeness Support#41
Draft
Add Privacy Risk Assessment Module with T-Closeness Support#41
Conversation
Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>
Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add privacy risk assessment module for Spark datasets
Add Privacy Risk Assessment Module with T-Closeness Support
Oct 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR implements a comprehensive privacy risk assessment module that helps data engineering teams evaluate and mitigate re-identification risks in Spark datasets. The module provides automatic quasi-identifier detection, multi-metric privacy analysis, and actionable recommendations for data anonymization.
What's New
Privacy Risk Assessment Module
The new
PrivacyRiskAssessmentobject provides a unified interface for evaluating privacy risks across multiple dimensions:Key Features:
T-Closeness Analyser
Implements the t-closeness privacy principle as a new analyser class:
T-closeness extends l-diversity by requiring that the distribution of sensitive attributes in each equivalence class is close to the overall distribution, preventing attribute disclosure through skewed distributions.
Example Output
The module generates comprehensive reports like this:
Integration with Existing Tools
The module seamlessly integrates with Maskala's existing privacy tools:
Files Added
src/main/scala/org/mitchelllisle/analysers/PrivacyRiskAssessment.scala- Main risk assessment modulesrc/main/scala/org/mitchelllisle/analysers/TCloseness.scala- T-closeness privacy analysersrc/main/scala/org/mitchelllisle/examples/PrivacyRiskAssessmentExample.scala- Usage examplessrc/test/scala/PrivacyRiskAssessmentTest.scala- 10 comprehensive testssrc/test/scala/TClosenessTest.scala- 5 comprehensive testsPRIVACY_RISK_ASSESSMENT_SUMMARY.md- Implementation documentationTesting
Documentation
This implementation helps data engineers ensure individual privacy and compliance by understanding and mitigating re-identification risks in their data.
Original prompt
This pull request was created as a result of the following prompt from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.