|
| 1 | +--- |
| 2 | +title: How to preserve data privacy using the WhiteNoise packages |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Learn how to apply differential privacy best practices to Azure Machine Learning models by using the WhiteNoise packages. |
| 5 | +services: machine-learning |
| 6 | +ms.service: machine-learning |
| 7 | +ms.subservice: core |
| 8 | +ms.topic: conceptual |
| 9 | +ms.author: slbird |
| 10 | +author: slbird |
| 11 | +ms.reviewer: luquinta |
| 12 | +ms.date: 05/17/2020 |
| 13 | +# Customer intent: As an experienced data scientist, I want to use differential privacy in Azure Machine Learning. |
| 14 | +--- |
| 15 | + |
| 16 | +# Use differential privacy in Azure Machine Learning |
| 17 | + |
| 18 | +[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)] |
| 19 | + |
| 20 | +Learn how to apply differential privacy best practices to Azure Machine Learning models by using the WhiteNoise Python packages. |
| 21 | + |
| 22 | +Differential privacy is the gold-standard definition of privacy. Systems that adhere to this definition of privacy provide strong assurances against a wide range of data reconstruction and reidentification attacks, including attacks by adversaries who possess auxiliary information. Learn more about how [differential privacy works](./concept-differential-privacy.md). |
| 23 | + |
| 24 | +## Prerequisites |
| 25 | + |
| 26 | +- If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree) today. |
| 27 | +- [Python 3](https://www.python.org/downloads/) |
| 28 | + |
| 29 | +## Install WhiteNoise packages |
| 30 | + |
| 31 | +### Standalone installation |
| 32 | + |
| 33 | +The libraries are designed to work from distributed Spark clusters, and can be installed just like any other package. |
| 34 | + |
| 35 | +The instructions below assume that your `python` and `pip` commands are mapped to `python3` and `pip3`. |
| 36 | + |
| 37 | +Use pip to install the [WhiteNoise Python packages](https://pypi.org/project/opendp-whitenoise/). |
| 38 | + |
| 39 | +`pip install opendp-whitenoise` |
| 40 | + |
| 41 | +To verify that the packages are installed, launch a python prompt and type: |
| 42 | + |
| 43 | +```python |
| 44 | +import opendp.whitenoise.core |
| 45 | +import opendp.whitenoise.sql |
| 46 | +``` |
| 47 | + |
| 48 | +If the imports succeed, the libraries are installed and ready to use. |
| 49 | + |
| 50 | +### Docker image |
| 51 | + |
| 52 | +You can also use WhiteNoise packages with Docker. |
| 53 | + |
| 54 | +Pull the `opendp/whitenoise` image to use the libraries inside a Docker container that includes Spark, Jupyter, and sample code. |
| 55 | + |
| 56 | +```sh |
| 57 | +docker pull opendp/whitenoise:privacy |
| 58 | +``` |
| 59 | + |
| 60 | +Once you've pulled the image, launch the Jupyter server: |
| 61 | + |
| 62 | +```sh |
| 63 | +docker run --rm -p 8989:8989 --name whitenoise-run opendp/whitenoise:privacy |
| 64 | +``` |
| 65 | + |
| 66 | +This starts a Jupyter server at port `8989` on your `localhost`, with password `pass@word99`. Assuming you used the command line above to start the container with name `whitenoise-privacy`, you can open a bash terminal in the Jupyter server by running: |
| 67 | + |
| 68 | +```sh |
| 69 | +docker exec -it whitenoise-run bash |
| 70 | +``` |
| 71 | + |
| 72 | +The Docker instance clears all state on shutdown, so you will lose any notebooks you create in the running instance. To remedy this, you can bind mount a local folder to the container when you launch it: |
| 73 | + |
| 74 | +```sh |
| 75 | +docker run --rm -p 8989:8989 --name whitenoise-run --mount type=bind,source=/Users/your_name/my-notebooks,target=/home/privacy/my-notebooks opendp/whitenoise:privacy |
| 76 | +``` |
| 77 | + |
| 78 | +Any notebooks you create under the *my-notebooks* folder will be stored in your local filesystem. |
| 79 | + |
| 80 | +## Perform data analysis |
| 81 | + |
| 82 | +To prepare a differentially private release, you need to choose a data source, a statistic, and some privacy parameters, indicating the level of privacy protection. |
| 83 | + |
| 84 | +This sample references the California Public Use Microdata (PUMS), representing anonymized records of citizen demographics: |
| 85 | + |
| 86 | +```python |
| 87 | +import os |
| 88 | +import sys |
| 89 | +import numpy as np |
| 90 | +import opendp.whitenoise.core as wn |
| 91 | + |
| 92 | +data_path = os.path.join('.', 'data', 'PUMS_california_demographics_1000', 'data.csv') |
| 93 | +var_names = ["age", "sex", "educ", "race", "income", "married", "pid"] |
| 94 | +``` |
| 95 | + |
| 96 | +In this example, we compute the mean and the variance of the age. We use a total `epsilon` of 1.0 (epsilon is our privacy parameter, spreading our privacy budget across the two quantities we want to compute. Learn more about [privacy metrics](concept-differential-privacy.md#differential-privacy-metrics). |
| 97 | + |
| 98 | +```python |
| 99 | +with wn.Analysis() as analysis: |
| 100 | + # load data |
| 101 | + data = wn.Dataset(path = data_path, column_names = var_names) |
| 102 | + |
| 103 | + # get mean of age |
| 104 | + age_mean = wn.dp_mean(data = wn.cast(data['age'], type="FLOAT"), |
| 105 | + privacy_usage = {'epsilon': .65}, |
| 106 | + data_lower = 0., |
| 107 | + data_upper = 100., |
| 108 | + data_n = 1000 |
| 109 | + ) |
| 110 | + # get variance of age |
| 111 | + age_var = wn.dp_variance(data = wn.cast(data['age'], type="FLOAT"), |
| 112 | + privacy_usage = {'epsilon': .35}, |
| 113 | + data_lower = 0., |
| 114 | + data_upper = 100., |
| 115 | + data_n = 1000 |
| 116 | + ) |
| 117 | +analysis.release() |
| 118 | + |
| 119 | +print("DP mean of age: {0}".format(age_mean.value)) |
| 120 | +print("DP variance of age: {0}".format(age_var.value)) |
| 121 | +print("Privacy usage: {0}".format(analysis.privacy_usage)) |
| 122 | +``` |
| 123 | + |
| 124 | +The results look something like those below: |
| 125 | + |
| 126 | +```text |
| 127 | +DP mean of age: 44.55598845931517 |
| 128 | +DP variance of age: 231.79044646429134 |
| 129 | +Privacy usage: approximate { |
| 130 | + epsilon: 1.0 |
| 131 | +} |
| 132 | +``` |
| 133 | + |
| 134 | +There are some important things to note about this example. First, the `Analysis` object represents a data processing graph. In this example, the mean and variance are computed from the same source node. However, you can include more complex expressions that combine inputs with outputs in arbitrary ways. |
| 135 | + |
| 136 | +The analysis graph includes `data_upper` and `data_lower` metadata, specifying the lower and upper bounds for ages. These values are used to precisely calibrate the noise to ensure differential privacy. These values are also used in some handling of outliers or missing values. |
| 137 | + |
| 138 | +Finally, the analysis graph keeps track of the total privacy budget spent. |
| 139 | + |
| 140 | +You can use the library to compose more complex analysis graphs, with several mechanisms, statistics, and utility functions: |
| 141 | + |
| 142 | +| Statistics | Mechanisms | Utilities | |
| 143 | +| ------------- |------------|------------| |
| 144 | +| Count | Gaussian | Cast | |
| 145 | +| Histogram | Geometric | Clamping | |
| 146 | +| Mean | Laplace | Digitize | |
| 147 | +| Quantiles | | Filter | |
| 148 | +| Sum | | Imputation | |
| 149 | +| Variance/Covariance | | Transform | |
| 150 | + |
| 151 | +See the [basic data analysis notebook](https://github.com/opendifferentialprivacy/whitenoise-samples/blob/master/analysis/basic_data_analysis.ipynb) for more details. |
| 152 | + |
| 153 | +## Approximate utility of differentially private releases |
| 154 | + |
| 155 | +Because differential privacy operates by calibrating noise, the utility of releases may vary depending on the privacy risk. Generally, the noise needed to protect each individual becomes negligible as sample sizes grow large, but overwhelm the result for releases that target a single individual. Analysts can review the accuracy information for a release to determine how useful the release is: |
| 156 | + |
| 157 | +```python |
| 158 | +with wn.Analysis() as analysis: |
| 159 | + # load data |
| 160 | + data = wn.Dataset(path = data_path, column_names = var_names) |
| 161 | + |
| 162 | + # get mean of age |
| 163 | + age_mean = wn.dp_mean(data = wn.cast(data['age'], type="FLOAT"), |
| 164 | + privacy_usage = {'epsilon': .65}, |
| 165 | + data_lower = 0., |
| 166 | + data_upper = 100., |
| 167 | + data_n = 1000 |
| 168 | + ) |
| 169 | +analysis.release() |
| 170 | + |
| 171 | +print("Age accuracy is: {0}".format(age_mean.get_accuracy(0.05))) |
| 172 | +``` |
| 173 | + |
| 174 | +The result of that operation should look similar to that below: |
| 175 | + |
| 176 | +```text |
| 177 | +Age accuracy is: 0.2995732273553991 |
| 178 | +``` |
| 179 | + |
| 180 | +This example computes the mean as above, and uses the `get_accuracy` function to request accuracy at `alpha` of 0.05. An `alpha` of 0.05 represents a 95% interval, in that released value will fall within the reported accuracy bounds about 95% of the time. In this example, the reported accuracy is 0.3, which means the released value will be within an interval of width 0.6, about 95% of the time. It is not correct to think of this value as an error bar, since the released value will fall outside the reported accuracy range at the rate specified by `alpha`, and values outside the range may be outside in either direction. |
| 181 | + |
| 182 | +Analysts may query `get_accuracy` for different values of `alpha` to get narrower or wider confidence intervals, without incurring additional privacy cost. |
| 183 | + |
| 184 | +## Generate a histogram |
| 185 | + |
| 186 | +The built-in `dp_histogram` function creates differentially private histograms over any of the following data types: |
| 187 | + |
| 188 | +- A continuous variable, where the set of numbers has to be divided into bins |
| 189 | +- A boolean or dichotomous variable, that can only take on two values |
| 190 | +- A categorical variable, where there are distinct categories enumerated as strings |
| 191 | + |
| 192 | +Here is an example of an `Analysis` specifying bins for a continuous variable histogram: |
| 193 | + |
| 194 | +```python |
| 195 | +income_edges = list(range(0, 100000, 10000)) |
| 196 | + |
| 197 | +with wn.Analysis() as analysis: |
| 198 | + data = wn.Dataset(path = data_path, column_names = var_names) |
| 199 | + |
| 200 | + income_histogram = wn.dp_histogram( |
| 201 | + wn.cast(data['income'], type='int', lower=0, upper=100), |
| 202 | + edges = income_edges, |
| 203 | + upper = 1000, |
| 204 | + null_value = 150, |
| 205 | + privacy_usage = {'epsilon': 0.5} |
| 206 | + ) |
| 207 | +``` |
| 208 | + |
| 209 | +Because the individuals are disjointly partitioned among histogram bins, the privacy cost is incurred only once per histogram, even if the histogram includes many bins. |
| 210 | + |
| 211 | +For more on histograms, see the [histograms notebook](https://github.com/opendifferentialprivacy/whitenoise-samples/blob/master/analysis/histograms.ipynb). |
| 212 | + |
| 213 | +## Generate a covariance matrix |
| 214 | + |
| 215 | +WhiteNoise offers three different functionalities with its `dp_covariance` function: |
| 216 | + |
| 217 | +- Covariance between two vectors |
| 218 | +- Covariance matrix of a matrix |
| 219 | +- Cross-covariance matrix of a pair of matrices |
| 220 | + |
| 221 | +Here is an example of computing a scalar covariance: |
| 222 | + |
| 223 | +```python |
| 224 | +with wn.Analysis() as analysis: |
| 225 | + wn_data = wn.Dataset(path = data_path, column_names = var_names) |
| 226 | + |
| 227 | + age_income_cov_scalar = wn.dp_covariance( |
| 228 | + left = wn.cast(wn_data['age'], |
| 229 | + type = "FLOAT"), |
| 230 | + right = wn.cast(wn_data['income'], |
| 231 | + type = "FLOAT"), |
| 232 | + privacy_usage = {'epsilon': 1.0}, |
| 233 | + left_lower = 0., |
| 234 | + left_upper = 100., |
| 235 | + left_n = 1000, |
| 236 | + right_lower = 0., |
| 237 | + right_upper = 500_000., |
| 238 | + right_n = 1000) |
| 239 | +``` |
| 240 | + |
| 241 | +For more information, see the [covariance notebook]( |
| 242 | +https://github.com/opendifferentialprivacy/whitenoise-samples/blob/master/analysis/covariance.ipynb) |
| 243 | + |
| 244 | +## Next Steps |
| 245 | + |
| 246 | +- Explore [WhiteNoise sample notebooks](https://github.com/opendifferentialprivacy/whitenoise-samples/tree/master/analysis). |
0 commit comments