Add Privacy Risk Assessment Module with T-Closeness Support by Copilot · Pull Request #41 · mitchelllisle/maskala

Copilot · 2025-10-12T00:14:47Z

Overview

This PR implements a comprehensive privacy risk assessment module that helps data engineering teams evaluate and mitigate re-identification risks in Spark datasets. The module provides automatic quasi-identifier detection, multi-metric privacy analysis, and actionable recommendations for data anonymization.

What's New

Privacy Risk Assessment Module

The new PrivacyRiskAssessment object provides a unified interface for evaluating privacy risks across multiple dimensions:

import org.mitchelllisle.analysers.{PrivacyRiskAssessment, PrivacyRiskParams}

val params = PrivacyRiskParams(
  quasiIdentifiers = Seq("age", "gender", "zipcode"),
  sensitiveAttribute = Some("disease"),
  idColumn = Some("patient_id")
)

val result = PrivacyRiskAssessment.assess(data, params)
val report = PrivacyRiskAssessment.generateReport(result)

Key Features:

Automatic Quasi-Identifier Detection: Intelligently identifies columns that could be used for re-identification based on column names and cardinality patterns
Multi-Metric Analysis: Simultaneously evaluates k-anonymity, l-diversity, t-closeness, and uniqueness in a single pass
Overall Risk Scoring: Generates a comprehensive 0-100 risk score with component breakdown and risk level classification
Actionable Recommendations: Provides specific guidance on which quasi-identifiers to generalize and which records to suppress

T-Closeness Analyser

Implements the t-closeness privacy principle as a new analyser class:

import org.mitchelllisle.analysers.TCloseness

val tClose = new TCloseness(t = 0.3)
val isSafe = tClose.isTClose(data, "disease")
val filtered = tClose.removeLessThanTRows(data, "disease")

T-closeness extends l-diversity by requiring that the distribution of sensitive attributes in each equivalence class is close to the overall distribution, preventing attribute disclosure through skewed distributions.

Example Output

The module generates comprehensive reports like this:

================================================================================
PRIVACY RISK ASSESSMENT REPORT
================================================================================

Overall Risk Score: 15/100
Risk Level: LOW ✓

--------------------------------------------------------------------------------
PRIVACY METRICS
--------------------------------------------------------------------------------
k-Anonymity Score: 3
l-Diversity Score: 3
t-Closeness Score: 0.167
Uniqueness Risk: 0.00%

--------------------------------------------------------------------------------
RECOMMENDATIONS
--------------------------------------------------------------------------------
1. k-anonymity: PASSED - Minimum group size is 3 (threshold: 3).
2. l-diversity: PASSED - Minimum diversity is 3 (threshold: 2).
3. t-closeness: PASSED - Maximum distribution distance is 0.167 (threshold: 0.300).
4. Uniqueness: PASSED - No highly unique records detected.
================================================================================

Integration with Existing Tools

The module seamlessly integrates with Maskala's existing privacy tools:

// Assess initial risk
val initialRisk = PrivacyRiskAssessment.assess(rawData, params)

// Apply anonymization based on recommendations
val anonymiser = new Anonymiser("config.yaml")
val anonymizedData = anonymiser(rawData)

// Verify improvement
val finalRisk = PrivacyRiskAssessment.assess(anonymizedData, params)
println(s"Risk Reduction: ${initialRisk.overallRiskScore - finalRisk.overallRiskScore} points")

Files Added

src/main/scala/org/mitchelllisle/analysers/PrivacyRiskAssessment.scala - Main risk assessment module
src/main/scala/org/mitchelllisle/analysers/TCloseness.scala - T-closeness privacy analyser
src/main/scala/org/mitchelllisle/examples/PrivacyRiskAssessmentExample.scala - Usage examples
src/test/scala/PrivacyRiskAssessmentTest.scala - 10 comprehensive tests
src/test/scala/TClosenessTest.scala - 5 comprehensive tests
PRIVACY_RISK_ASSESSMENT_SUMMARY.md - Implementation documentation

Testing

Added 15 new comprehensive tests (all passing)
All existing 47 tests still passing
Total test suite: 62/62 passing ✓

Documentation

Added "Privacy Risk Assessment" section to README with complete usage examples
Added "T-Closeness" section to README explaining the privacy principle
Included standalone example file demonstrating three common scenarios
Provided implementation summary document

This implementation helps data engineers ensure individual privacy and compliance by understanding and mitigating re-identification risks in their data.

Original prompt

Implement a privacy risk assessment module to evaluate user re-identification risks in Spark datasets. This feature should:

Detect quasi-identifiers in input datasets

Calculate privacy metrics such as k-anonymity, l-diversity, and t-closeness

Output risk scores and actionable recommendations for further anonymisation

Integrate seamlessly with Spark workflows

Provide example usage and documentation

The goal is to help data engineering teams ensure individual privacy and compliance by understanding and mitigating risks of re-identification in their data.

This pull request was created as a result of the following prompt from Copilot chat.

Implement a privacy risk assessment module to evaluate user re-identification risks in Spark datasets. This feature should:

Detect quasi-identifiers in input datasets

Calculate privacy metrics such as k-anonymity, l-diversity, and t-closeness

Output risk scores and actionable recommendations for further anonymisation

Integrate seamlessly with Spark workflows

Provide example usage and documentation

The goal is to help data engineering teams ensure individual privacy and compliance by understanding and mitigating risks of re-identification in their data.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>

Initial plan

96f8f00

Copilot AI assigned Copilot and mitchelllisle Oct 12, 2025

Copilot started work on behalf of mitchelllisle October 12, 2025 00:15 View session

Copilot AI and others added 2 commits October 12, 2025 00:25

Implement privacy risk assessment module with t-closeness

89fb868

Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>

Add implementation summary for privacy risk assessment module

fd89f52

Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add privacy risk assessment module for Spark datasets~~ Add Privacy Risk Assessment Module with T-Closeness Support Oct 12, 2025

Copilot finished work on behalf of mitchelllisle October 12, 2025 00:32

Copilot AI requested a review from mitchelllisle October 12, 2025 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Privacy Risk Assessment Module with T-Closeness Support#41

Add Privacy Risk Assessment Module with T-Closeness Support#41
Copilot wants to merge 3 commits intomainfrom
copilot/add-privacy-risk-assessment-module

Copilot AI commented Oct 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What's New

Privacy Risk Assessment Module

T-Closeness Analyser

Example Output

Integration with Existing Tools

Files Added

Testing

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 12, 2025 •

edited

Loading