Skip to content

Commit 8733bc0

Browse files
authored
Merge pull request #115353 from luisquintanilla/AB1705439
Responsible ML | Differential Privacy How-To
2 parents e95e84c + 5eefaa8 commit 8733bc0

File tree

2 files changed

+249
-0
lines changed

2 files changed

+249
-0
lines changed
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
---
2+
title: How to preserve data privacy using the WhiteNoise packages
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to apply differential privacy best practices to Azure Machine Learning models by using the WhiteNoise packages.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
ms.author: slbird
10+
author: slbird
11+
ms.reviewer: luquinta
12+
ms.date: 05/17/2020
13+
# Customer intent: As an experienced data scientist, I want to use differential privacy in Azure Machine Learning.
14+
---
15+
16+
# Use differential privacy in Azure Machine Learning
17+
18+
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)]
19+
20+
Learn how to apply differential privacy best practices to Azure Machine Learning models by using the WhiteNoise Python packages.
21+
22+
Differential privacy is the gold-standard definition of privacy. Systems that adhere to this definition of privacy provide strong assurances against a wide range of data reconstruction and reidentification attacks, including attacks by adversaries who possess auxiliary information. Learn more about how [differential privacy works](./concept-differential-privacy.md).
23+
24+
## Prerequisites
25+
26+
- If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree) today.
27+
- [Python 3](https://www.python.org/downloads/)
28+
29+
## Install WhiteNoise packages
30+
31+
### Standalone installation
32+
33+
The libraries are designed to work from distributed Spark clusters, and can be installed just like any other package.
34+
35+
The instructions below assume that your `python` and `pip` commands are mapped to `python3` and `pip3`.
36+
37+
Use pip to install the [WhiteNoise Python packages](https://pypi.org/project/opendp-whitenoise/).
38+
39+
`pip install opendp-whitenoise`
40+
41+
To verify that the packages are installed, launch a python prompt and type:
42+
43+
```python
44+
import opendp.whitenoise.core
45+
import opendp.whitenoise.sql
46+
```
47+
48+
If the imports succeed, the libraries are installed and ready to use.
49+
50+
### Docker image
51+
52+
You can also use WhiteNoise packages with Docker.
53+
54+
Pull the `opendp/whitenoise` image to use the libraries inside a Docker container that includes Spark, Jupyter, and sample code.
55+
56+
```sh
57+
docker pull opendp/whitenoise:privacy
58+
```
59+
60+
Once you've pulled the image, launch the Jupyter server:
61+
62+
```sh
63+
docker run --rm -p 8989:8989 --name whitenoise-run opendp/whitenoise:privacy
64+
```
65+
66+
This starts a Jupyter server at port `8989` on your `localhost`, with password `pass@word99`. Assuming you used the command line above to start the container with name `whitenoise-privacy`, you can open a bash terminal in the Jupyter server by running:
67+
68+
```sh
69+
docker exec -it whitenoise-run bash
70+
```
71+
72+
The Docker instance clears all state on shutdown, so you will lose any notebooks you create in the running instance. To remedy this, you can bind mount a local folder to the container when you launch it:
73+
74+
```sh
75+
docker run --rm -p 8989:8989 --name whitenoise-run --mount type=bind,source=/Users/your_name/my-notebooks,target=/home/privacy/my-notebooks opendp/whitenoise:privacy
76+
```
77+
78+
Any notebooks you create under the *my-notebooks* folder will be stored in your local filesystem.
79+
80+
## Perform data analysis
81+
82+
To prepare a differentially private release, you need to choose a data source, a statistic, and some privacy parameters, indicating the level of privacy protection.
83+
84+
This sample references the California Public Use Microdata (PUMS), representing anonymized records of citizen demographics:
85+
86+
```python
87+
import os
88+
import sys
89+
import numpy as np
90+
import opendp.whitenoise.core as wn
91+
92+
data_path = os.path.join('.', 'data', 'PUMS_california_demographics_1000', 'data.csv')
93+
var_names = ["age", "sex", "educ", "race", "income", "married", "pid"]
94+
```
95+
96+
In this example, we compute the mean and the variance of the age. We use a total `epsilon` of 1.0 (epsilon is our privacy parameter, spreading our privacy budget across the two quantities we want to compute. Learn more about [privacy metrics](concept-differential-privacy.md#differential-privacy-metrics).
97+
98+
```python
99+
with wn.Analysis() as analysis:
100+
# load data
101+
data = wn.Dataset(path = data_path, column_names = var_names)
102+
103+
# get mean of age
104+
age_mean = wn.dp_mean(data = wn.cast(data['age'], type="FLOAT"),
105+
privacy_usage = {'epsilon': .65},
106+
data_lower = 0.,
107+
data_upper = 100.,
108+
data_n = 1000
109+
)
110+
# get variance of age
111+
age_var = wn.dp_variance(data = wn.cast(data['age'], type="FLOAT"),
112+
privacy_usage = {'epsilon': .35},
113+
data_lower = 0.,
114+
data_upper = 100.,
115+
data_n = 1000
116+
)
117+
analysis.release()
118+
119+
print("DP mean of age: {0}".format(age_mean.value))
120+
print("DP variance of age: {0}".format(age_var.value))
121+
print("Privacy usage: {0}".format(analysis.privacy_usage))
122+
```
123+
124+
The results look something like those below:
125+
126+
```text
127+
DP mean of age: 44.55598845931517
128+
DP variance of age: 231.79044646429134
129+
Privacy usage: approximate {
130+
epsilon: 1.0
131+
}
132+
```
133+
134+
There are some important things to note about this example. First, the `Analysis` object represents a data processing graph. In this example, the mean and variance are computed from the same source node. However, you can include more complex expressions that combine inputs with outputs in arbitrary ways.
135+
136+
The analysis graph includes `data_upper` and `data_lower` metadata, specifying the lower and upper bounds for ages. These values are used to precisely calibrate the noise to ensure differential privacy. These values are also used in some handling of outliers or missing values.
137+
138+
Finally, the analysis graph keeps track of the total privacy budget spent.
139+
140+
You can use the library to compose more complex analysis graphs, with several mechanisms, statistics, and utility functions:
141+
142+
| Statistics | Mechanisms | Utilities |
143+
| ------------- |------------|------------|
144+
| Count | Gaussian | Cast |
145+
| Histogram | Geometric | Clamping |
146+
| Mean | Laplace | Digitize |
147+
| Quantiles | | Filter |
148+
| Sum | | Imputation |
149+
| Variance/Covariance | | Transform |
150+
151+
See the [basic data analysis notebook](https://github.com/opendifferentialprivacy/whitenoise-samples/blob/master/analysis/basic_data_analysis.ipynb) for more details.
152+
153+
## Approximate utility of differentially private releases
154+
155+
Because differential privacy operates by calibrating noise, the utility of releases may vary depending on the privacy risk. Generally, the noise needed to protect each individual becomes negligible as sample sizes grow large, but overwhelm the result for releases that target a single individual. Analysts can review the accuracy information for a release to determine how useful the release is:
156+
157+
```python
158+
with wn.Analysis() as analysis:
159+
# load data
160+
data = wn.Dataset(path = data_path, column_names = var_names)
161+
162+
# get mean of age
163+
age_mean = wn.dp_mean(data = wn.cast(data['age'], type="FLOAT"),
164+
privacy_usage = {'epsilon': .65},
165+
data_lower = 0.,
166+
data_upper = 100.,
167+
data_n = 1000
168+
)
169+
analysis.release()
170+
171+
print("Age accuracy is: {0}".format(age_mean.get_accuracy(0.05)))
172+
```
173+
174+
The result of that operation should look similar to that below:
175+
176+
```text
177+
Age accuracy is: 0.2995732273553991
178+
```
179+
180+
This example computes the mean as above, and uses the `get_accuracy` function to request accuracy at `alpha` of 0.05. An `alpha` of 0.05 represents a 95% interval, in that released value will fall within the reported accuracy bounds about 95% of the time. In this example, the reported accuracy is 0.3, which means the released value will be within an interval of width 0.6, about 95% of the time. It is not correct to think of this value as an error bar, since the released value will fall outside the reported accuracy range at the rate specified by `alpha`, and values outside the range may be outside in either direction.
181+
182+
Analysts may query `get_accuracy` for different values of `alpha` to get narrower or wider confidence intervals, without incurring additional privacy cost.
183+
184+
## Generate a histogram
185+
186+
The built-in `dp_histogram` function creates differentially private histograms over any of the following data types:
187+
188+
- A continuous variable, where the set of numbers has to be divided into bins
189+
- A boolean or dichotomous variable, that can only take on two values
190+
- A categorical variable, where there are distinct categories enumerated as strings
191+
192+
Here is an example of an `Analysis` specifying bins for a continuous variable histogram:
193+
194+
```python
195+
income_edges = list(range(0, 100000, 10000))
196+
197+
with wn.Analysis() as analysis:
198+
data = wn.Dataset(path = data_path, column_names = var_names)
199+
200+
income_histogram = wn.dp_histogram(
201+
wn.cast(data['income'], type='int', lower=0, upper=100),
202+
edges = income_edges,
203+
upper = 1000,
204+
null_value = 150,
205+
privacy_usage = {'epsilon': 0.5}
206+
)
207+
```
208+
209+
Because the individuals are disjointly partitioned among histogram bins, the privacy cost is incurred only once per histogram, even if the histogram includes many bins.
210+
211+
For more on histograms, see the [histograms notebook](https://github.com/opendifferentialprivacy/whitenoise-samples/blob/master/analysis/histograms.ipynb).
212+
213+
## Generate a covariance matrix
214+
215+
WhiteNoise offers three different functionalities with its `dp_covariance` function:
216+
217+
- Covariance between two vectors
218+
- Covariance matrix of a matrix
219+
- Cross-covariance matrix of a pair of matrices
220+
221+
Here is an example of computing a scalar covariance:
222+
223+
```python
224+
with wn.Analysis() as analysis:
225+
wn_data = wn.Dataset(path = data_path, column_names = var_names)
226+
227+
age_income_cov_scalar = wn.dp_covariance(
228+
left = wn.cast(wn_data['age'],
229+
type = "FLOAT"),
230+
right = wn.cast(wn_data['income'],
231+
type = "FLOAT"),
232+
privacy_usage = {'epsilon': 1.0},
233+
left_lower = 0.,
234+
left_upper = 100.,
235+
left_n = 1000,
236+
right_lower = 0.,
237+
right_upper = 500_000.,
238+
right_n = 1000)
239+
```
240+
241+
For more information, see the [covariance notebook](
242+
https://github.com/opendifferentialprivacy/whitenoise-samples/blob/master/analysis/covariance.ipynb)
243+
244+
## Next Steps
245+
246+
- Explore [WhiteNoise sample notebooks](https://github.com/opendifferentialprivacy/whitenoise-samples/tree/master/analysis).

articles/machine-learning/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,9 @@
261261
- name: Version & track datasets
262262
displayName: data, data set
263263
href: how-to-version-track-datasets.md
264+
- name: Preserve data privacy
265+
displayName: data,privacy,differential privacy
266+
href: how-to-differential-privacy.md
264267
- name: Train models
265268
items:
266269
- name: Use estimators for ML

0 commit comments

Comments
 (0)