Skip to content

Add Privacy Risk Assessment Module with T-Closeness Support#41

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/add-privacy-risk-assessment-module
Draft

Add Privacy Risk Assessment Module with T-Closeness Support#41
Copilot wants to merge 3 commits intomainfrom
copilot/add-privacy-risk-assessment-module

Conversation

Copy link
Contributor

Copilot AI commented Oct 12, 2025

Overview

This PR implements a comprehensive privacy risk assessment module that helps data engineering teams evaluate and mitigate re-identification risks in Spark datasets. The module provides automatic quasi-identifier detection, multi-metric privacy analysis, and actionable recommendations for data anonymization.

What's New

Privacy Risk Assessment Module

The new PrivacyRiskAssessment object provides a unified interface for evaluating privacy risks across multiple dimensions:

import org.mitchelllisle.analysers.{PrivacyRiskAssessment, PrivacyRiskParams}

val params = PrivacyRiskParams(
  quasiIdentifiers = Seq("age", "gender", "zipcode"),
  sensitiveAttribute = Some("disease"),
  idColumn = Some("patient_id")
)

val result = PrivacyRiskAssessment.assess(data, params)
val report = PrivacyRiskAssessment.generateReport(result)

Key Features:

  • Automatic Quasi-Identifier Detection: Intelligently identifies columns that could be used for re-identification based on column names and cardinality patterns
  • Multi-Metric Analysis: Simultaneously evaluates k-anonymity, l-diversity, t-closeness, and uniqueness in a single pass
  • Overall Risk Scoring: Generates a comprehensive 0-100 risk score with component breakdown and risk level classification
  • Actionable Recommendations: Provides specific guidance on which quasi-identifiers to generalize and which records to suppress

T-Closeness Analyser

Implements the t-closeness privacy principle as a new analyser class:

import org.mitchelllisle.analysers.TCloseness

val tClose = new TCloseness(t = 0.3)
val isSafe = tClose.isTClose(data, "disease")
val filtered = tClose.removeLessThanTRows(data, "disease")

T-closeness extends l-diversity by requiring that the distribution of sensitive attributes in each equivalence class is close to the overall distribution, preventing attribute disclosure through skewed distributions.

Example Output

The module generates comprehensive reports like this:

================================================================================
PRIVACY RISK ASSESSMENT REPORT
================================================================================

Overall Risk Score: 15/100
Risk Level: LOW ✓

--------------------------------------------------------------------------------
PRIVACY METRICS
--------------------------------------------------------------------------------
k-Anonymity Score: 3
l-Diversity Score: 3
t-Closeness Score: 0.167
Uniqueness Risk: 0.00%

--------------------------------------------------------------------------------
RECOMMENDATIONS
--------------------------------------------------------------------------------
1. k-anonymity: PASSED - Minimum group size is 3 (threshold: 3).
2. l-diversity: PASSED - Minimum diversity is 3 (threshold: 2).
3. t-closeness: PASSED - Maximum distribution distance is 0.167 (threshold: 0.300).
4. Uniqueness: PASSED - No highly unique records detected.
================================================================================

Integration with Existing Tools

The module seamlessly integrates with Maskala's existing privacy tools:

// Assess initial risk
val initialRisk = PrivacyRiskAssessment.assess(rawData, params)

// Apply anonymization based on recommendations
val anonymiser = new Anonymiser("config.yaml")
val anonymizedData = anonymiser(rawData)

// Verify improvement
val finalRisk = PrivacyRiskAssessment.assess(anonymizedData, params)
println(s"Risk Reduction: ${initialRisk.overallRiskScore - finalRisk.overallRiskScore} points")

Files Added

  • src/main/scala/org/mitchelllisle/analysers/PrivacyRiskAssessment.scala - Main risk assessment module
  • src/main/scala/org/mitchelllisle/analysers/TCloseness.scala - T-closeness privacy analyser
  • src/main/scala/org/mitchelllisle/examples/PrivacyRiskAssessmentExample.scala - Usage examples
  • src/test/scala/PrivacyRiskAssessmentTest.scala - 10 comprehensive tests
  • src/test/scala/TClosenessTest.scala - 5 comprehensive tests
  • PRIVACY_RISK_ASSESSMENT_SUMMARY.md - Implementation documentation

Testing

  • Added 15 new comprehensive tests (all passing)
  • All existing 47 tests still passing
  • Total test suite: 62/62 passing ✓

Documentation

  • Added "Privacy Risk Assessment" section to README with complete usage examples
  • Added "T-Closeness" section to README explaining the privacy principle
  • Included standalone example file demonstrating three common scenarios
  • Provided implementation summary document

This implementation helps data engineers ensure individual privacy and compliance by understanding and mitigating re-identification risks in their data.

Original prompt

Implement a privacy risk assessment module to evaluate user re-identification risks in Spark datasets. This feature should:

  • Detect quasi-identifiers in input datasets
  • Calculate privacy metrics such as k-anonymity, l-diversity, and t-closeness
  • Output risk scores and actionable recommendations for further anonymisation
  • Integrate seamlessly with Spark workflows
  • Provide example usage and documentation

The goal is to help data engineering teams ensure individual privacy and compliance by understanding and mitigating risks of re-identification in their data.

This pull request was created as a result of the following prompt from Copilot chat.

Implement a privacy risk assessment module to evaluate user re-identification risks in Spark datasets. This feature should:

  • Detect quasi-identifiers in input datasets
  • Calculate privacy metrics such as k-anonymity, l-diversity, and t-closeness
  • Output risk scores and actionable recommendations for further anonymisation
  • Integrate seamlessly with Spark workflows
  • Provide example usage and documentation

The goal is to help data engineering teams ensure individual privacy and compliance by understanding and mitigating risks of re-identification in their data.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits October 12, 2025 00:25
Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>
Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>
Copilot AI changed the title [WIP] Add privacy risk assessment module for Spark datasets Add Privacy Risk Assessment Module with T-Closeness Support Oct 12, 2025
Copilot AI requested a review from mitchelllisle October 12, 2025 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants