Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions PRIVACY_RISK_ASSESSMENT_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Privacy Risk Assessment Module - Implementation Summary

## Overview
This implementation adds a comprehensive privacy risk assessment module to Maskala that evaluates re-identification risks in Spark datasets.

## What Was Implemented

### 1. Core Components

#### TCloseness Analyser (`TCloseness.scala`)
- Implements t-closeness privacy principle
- Measures distribution distance using Total Variation Distance
- Provides methods:
- `apply()`: Calculates distribution distances for equivalence classes
- `isTClose()`: Checks if dataset satisfies t-closeness
- `removeLessThanTRows()`: Filters out non-compliant equivalence classes

#### Privacy Risk Assessment Module (`PrivacyRiskAssessment.scala`)
- Comprehensive privacy risk evaluation framework
- Key features:
- **Automatic Quasi-Identifier Detection**: Uses heuristics based on column names and cardinality
- **Multi-Metric Analysis**: Calculates k-anonymity, l-diversity, and t-closeness simultaneously
- **Risk Scoring**: Generates overall risk score (0-100) based on all metrics
- **Actionable Recommendations**: Provides specific guidance for improving privacy

##### Main Components:
- `RiskAssessmentResult`: Case class holding assessment results
- `PrivacyRiskParams`: Case class for configuration parameters
- `assess()`: Main method to perform comprehensive risk assessment
- `detectQuasiIdentifiers()`: Automatic quasi-identifier detection
- `generateReport()`: Creates formatted risk assessment report

### 2. Testing

#### TClosenessTest (5 tests)
- Tests for distribution closeness validation
- Tests for filtering non-compliant records
- Tests with multiple quasi-identifiers
- Tests with uniform distributions

#### PrivacyRiskAssessmentTest (10 tests)
- Automatic quasi-identifier detection tests
- Basic privacy risk assessment with k-anonymity
- Combined assessment with l-diversity
- Combined assessment with t-closeness
- Uniqueness risk calculation
- Report generation
- Tests without ID column
- Risk score comparison tests
- Column exclusion tests
- Cardinality-based detection tests

### 3. Documentation

#### README.md Updates
- New "Privacy Risk Assessment" section with:
- Feature overview and key capabilities
- Basic usage examples
- Automatic quasi-identifier detection examples
- Result interpretation guide
- Integration with anonymization workflow
- New "T-Closeness" section with:
- Concept explanation
- Usage examples
- Filtering examples

#### Example Code (`PrivacyRiskAssessmentExample.scala`)
- Three comprehensive examples:
1. Basic privacy risk assessment
2. Automatic quasi-identifier detection
3. Before/after anonymization comparison

## Key Features Delivered

1. ✅ **Detects quasi-identifiers** - Automatic detection based on column names and cardinality
2. ✅ **Calculates k-anonymity** - Minimum group size in dataset
3. ✅ **Calculates l-diversity** - Diversity of sensitive attributes
4. ✅ **Calculates t-closeness** - Distribution distance from overall population
5. ✅ **Generates risk scores** - 0-100 overall risk score with component breakdown
6. ✅ **Provides recommendations** - Actionable guidance for improving privacy
7. ✅ **Seamless Spark integration** - Works naturally with DataFrames
8. ✅ **Comprehensive documentation** - Examples and usage guidance in README

## Privacy Metrics Explained

### K-Anonymity Score
- Represents the minimum group size in the dataset
- Higher values indicate better privacy (harder to single out individuals)
- Contributes up to 40 points to overall risk score

### L-Diversity Score
- Minimum number of distinct sensitive values in equivalence classes
- Higher values indicate better diversity
- Contributes up to 25 points to overall risk score

### T-Closeness Score
- Maximum distribution distance from overall population
- Lower values indicate better privacy (distributions are similar)
- Contributes up to 20 points to overall risk score

### Uniqueness Risk
- Ratio of records with uniqueness = 1 (highly identifiable)
- Lower values indicate better privacy
- Contributes up to 15 points to overall risk score

### Overall Risk Score
- Composite score from 0-100
- 0-20: Low risk ✓
- 20-40: Moderate risk ⚠
- 40-60: High risk ⚠⚠
- 60-100: Critical risk ⚠⚠⚠

## Usage Example

```scala
import org.apache.spark.sql.SparkSession
import org.mitchelllisle.analysers.{PrivacyRiskAssessment, PrivacyRiskParams}

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

val data = Seq(
("1", "30", "Male", "12345", "Heart Disease"),
("2", "30", "Male", "12345", "Diabetes")
// ... more data
).toDF("patient_id", "age", "gender", "zipcode", "disease")

val params = PrivacyRiskParams(
quasiIdentifiers = Seq("age", "gender", "zipcode"),
sensitiveAttribute = Some("disease"),
idColumn = Some("patient_id")
)

val result = PrivacyRiskAssessment.assess(data, params)
val report = PrivacyRiskAssessment.generateReport(result)
println(report)
```

## Testing Summary
- Total new tests: 15 (5 for TCloseness, 10 for PrivacyRiskAssessment)
- All tests passing ✓
- Existing tests still passing ✓
- Code compiles successfully ✓

## Integration Points
- Works with existing KAnonymity, LDiversity, and UniquenessAnalyser classes
- Compatible with Anonymiser workflow for iterative privacy improvement
- Follows existing code patterns and conventions in the repository
188 changes: 188 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,135 @@ analyse:
These methods are tools to aid in understanding and reducing re-identification risks and should be used as part of a
broader data protection strategy. Remember, no single method can ensure total data privacy and security.

### Privacy Risk Assessment

The Privacy Risk Assessment module provides a comprehensive evaluation of re-identification risks in your Spark datasets. It automatically detects quasi-identifiers, calculates multiple privacy metrics (k-anonymity, l-diversity, t-closeness), and generates actionable recommendations for further anonymization.

#### Key Features:
- **Automatic Quasi-Identifier Detection**: Identifies columns that could be used to re-identify individuals
- **Multi-Metric Analysis**: Evaluates k-anonymity, l-diversity, and t-closeness simultaneously
- **Risk Scoring**: Provides an overall risk score (0-100) for easy assessment
- **Actionable Recommendations**: Generates specific recommendations to improve privacy
- **Seamless Integration**: Works naturally with existing Spark workflows

#### Basic Usage

```scala
import org.apache.spark.sql.SparkSession
import org.mitchelllisle.analysers.{PrivacyRiskAssessment, PrivacyRiskParams}

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

// Sample healthcare data
val patientData = Seq(
("1", "30", "Male", "12345", "Heart Disease"),
("2", "30", "Male", "12345", "Diabetes"),
("3", "30", "Male", "12345", "Flu"),
("4", "45", "Female", "67890", "Heart Disease"),
("5", "45", "Female", "67890", "Cancer"),
("6", "45", "Female", "67890", "Flu")
).toDF("patient_id", "age", "gender", "zipcode", "disease")

// Define privacy assessment parameters
val params = PrivacyRiskParams(
quasiIdentifiers = Seq("age", "gender", "zipcode"),
sensitiveAttribute = Some("disease"),
idColumn = Some("patient_id")
)

// Perform comprehensive risk assessment
val result = PrivacyRiskAssessment.assess(
data = patientData,
params = params,
kThreshold = 3, // Minimum group size
lThreshold = 2, // Minimum diversity
tThreshold = 0.3 // Maximum distribution distance
)

// Generate and print detailed report
val report = PrivacyRiskAssessment.generateReport(result)
println(report)
```

**Output:**
```
================================================================================
PRIVACY RISK ASSESSMENT REPORT
================================================================================

Overall Risk Score: 15/100
Risk Level: LOW ✓

--------------------------------------------------------------------------------
PRIVACY METRICS
--------------------------------------------------------------------------------
k-Anonymity Score: 3
l-Diversity Score: 3
t-Closeness Score: 0.167
Uniqueness Risk: 0.00%

--------------------------------------------------------------------------------
RECOMMENDATIONS
--------------------------------------------------------------------------------
1. k-anonymity: PASSED - Minimum group size is 3 (threshold: 3).
2. l-diversity: PASSED - Minimum diversity is 3 (threshold: 2).
3. t-closeness: PASSED - Maximum distribution distance is 0.167 (threshold: 0.300).
4. Uniqueness: PASSED - No highly unique records detected.

================================================================================
```

#### Automatic Quasi-Identifier Detection

Let the module automatically detect potential quasi-identifiers based on column names and cardinality:

```scala
val employeeData = Seq(
("1", "30", "Male", "12345", "Engineering", "80000"),
("2", "25", "Female", "12346", "Marketing", "70000"),
("3", "30", "Male", "12347", "Engineering", "85000")
).toDF("employee_id", "age", "gender", "zipcode", "department", "salary")

// Auto-detect quasi-identifiers
val detectedQuasiIds = PrivacyRiskAssessment.detectQuasiIdentifiers(
data = employeeData,
excludeColumns = Seq("employee_id", "salary") // Exclude ID and sensitive columns
)

println(s"Detected quasi-identifiers: ${detectedQuasiIds.mkString(", ")}")
// Output: Detected quasi-identifiers: age, gender, zipcode
```

#### Understanding the Results

The `RiskAssessmentResult` contains:
- **kAnonymityScore**: Minimum group size in the dataset (higher is better)
- **lDiversityScore**: Minimum diversity in sensitive attributes (higher is better)
- **tClosenessScore**: Maximum distribution distance from overall distribution (lower is better)
- **uniquenessRisk**: Ratio of highly identifiable records (lower is better)
- **overallRiskScore**: Composite risk score from 0-100 (0 = lowest risk, 100 = highest risk)
- **recommendations**: List of specific actions to improve privacy

#### Integration with Anonymisation Workflow

Use the risk assessment to guide your anonymization strategy:

```scala
// Step 1: Assess initial risk
val initialRisk = PrivacyRiskAssessment.assess(rawData, params)
println(s"Initial Risk Score: ${initialRisk.overallRiskScore.toInt}/100")

// Step 2: Apply anonymization based on recommendations
val anonymiser = new Anonymiser("config.yaml")
val anonymizedData = anonymiser(rawData)

// Step 3: Re-assess to verify improvement
val finalRisk = PrivacyRiskAssessment.assess(anonymizedData, params)
println(s"Final Risk Score: ${finalRisk.overallRiskScore.toInt}/100")
println(s"Risk Reduction: ${(initialRisk.overallRiskScore - finalRisk.overallRiskScore).toInt} points")
```

### KAnonymity
K-Anonymity is a concept in data privacy that aims to ensure an individual's information cannot be distinguished from a
least k-1 others in a dataset. Essentially, it means that each individual's data is indistinguishable from at least k-1
Expand Down Expand Up @@ -197,6 +326,65 @@ val result = kAnon.removeLessThanKRows(data)
* */
```

### T-Closeness
T-Closeness is a privacy principle that extends both K-Anonymity and ℓ-Diversity by requiring that the distribution of
sensitive attributes in any equivalence class is close to the distribution in the overall dataset. While ℓ-Diversity
ensures diversity of sensitive values, T-Closeness goes further by preventing skewed distributions that could still
reveal sensitive information. The "closeness" is measured using distance metrics between distributions, with a threshold
`t` that defines the maximum allowable distance. T-Closeness helps protect against attribute disclosure by ensuring that
the sensitive attribute values in each group don't differ significantly from the overall population distribution.

#### 1: Assessing T-Closeness
You can assess if your dataset satisfies T-Closeness by using the `isTClose` method:

```scala
import org.mitchelllisle.analysers.TCloseness
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()

import spark.implicits._

val data = Seq(
("A", "Disease1"),
("A", "Disease2"),
("A", "Disease3"),
("B", "Disease1"),
("B", "Disease2"),
("B", "Disease3")
).toDF("QuasiIdentifier", "SensitiveAttribute")

val tClose = new TCloseness(t = 0.3) // Maximum distance threshold
val evaluated = tClose.isTClose(data, "SensitiveAttribute") // returns true
```

#### 2. Filtering out rows that aren't T-Close
If you want a dataset that only contains the rows that meet T-Closeness, you can use the `removeLessThanTRows` method:

```scala
import org.mitchelllisle.analysers.TCloseness
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()

import spark.implicits._

val data = Seq(
("A", "Disease1"),
("A", "Disease1"),
("A", "Disease1"),
("B", "Disease2"),
("B", "Disease2"),
("B", "Disease2")
).toDF("QuasiIdentifier", "SensitiveAttribute")

val tClose = new TCloseness(t = 0.2)

val result = tClose.removeLessThanTRows(data, "SensitiveAttribute")
// Result contains only equivalence classes where the distribution of SensitiveAttribute
// is within the threshold distance from the overall distribution
```

### Uniqueness Analyzer
The `UniquenessAnalyser` class in `org.mitchelllisle.reidentifiability` package provides methods to analyze the
uniqueness of values within a DataFrame using Spark. Uniqueness is a proxy for re-identifiability, an important privacy
Expand Down
Loading