Skip to content

Commit 803b660

Browse files
authored
Merge pull request #115274 from MicrosoftDocs/release-build-aml
Release build aml
2 parents 006854b + dd17e03 commit 803b660

13 files changed

+707
-8
lines changed
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
title: Implement differential privacy with the WhiteNoise package
3+
titleSuffix: Azure Machine Learning
4+
description: Learn what differential privacy is and how the WhiteNoise package can help you implement differential private systems that preserve data privacy.
5+
author: luisquintanilla
6+
ms.author: luquinta
7+
ms.date: 05/03/2020
8+
services: machine-learning
9+
ms.service: machine-learning
10+
ms.subservice: core
11+
ms.topic: conceptual
12+
#intent: As a data scientist, I want to know what differential privacy is and how WhiteNoise can help me implement a differentially private system.
13+
---
14+
15+
# Preserve data privacy by using differential privacy and the WhiteNoise package
16+
17+
Learn what differential privacy is and how the WhiteNoise package can help you implement differentially private systems.
18+
19+
As the amount of data that an organization collects and uses for analyses increases, so do concerns of privacy and security. Analyses require data. Typically, the more data used to train models, the more accurate they are. When personal information is used for these analyses, it's especially important that the data remains private throughout its use.
20+
21+
## How differential privacy works
22+
23+
Differential privacy is a set of systems and practices that help keep the data of individuals safe and private.
24+
25+
> [!div class="mx-imgBorder"]
26+
> ![Differential Privacy Process](./media/concept-differential-privacy/differential-privacy-process.jpg)
27+
28+
In traditional scenarios, raw data is stored in files and databases. When users analyze data, they typically use the raw data. This is a concern because it might infringe on an individual's privacy. Differential privacy tries to deal with this problem by adding "noise" or randomness to the data so that users can't identify any individual data points. At the least, such a system provides plausible deniability.
29+
30+
In differentially private systems, data is shared through requests called **queries**. When a user submits a query for data, operations known as **privacy mechanisms** add noise to the requested data. Privacy mechanisms return an *approximation of the data* instead of the raw data. This privacy-preserving result appears in a **report**. Reports consist of two parts, the actual data computed and a description of how the data was created.
31+
32+
## Differential privacy metrics
33+
34+
Differential privacy tries to protect against the possibility that a user can produce an indefinite number of reports to eventually reveal sensitive data. A value known as **epsilon** measures how noisy or private a report is. Epsilon has an inverse relationship to noise or privacy. The lower the epsilon, the more noisy (and private) the data is.
35+
36+
Epsilon values are non-negative. Values below 1 provide full plausible deniability. Anything above 1 comes with a higher risk of exposure of the actual data. As you implement differentially private systems, you want to produce reports with epsilon values between 0 and 1.
37+
38+
Another value directly correlated to epsilon is **delta**. Delta is a measure of the probability that a report is not fully private. The higher the delta, the higher the epsilon. Because these values are correlated, epsilon is used more often.
39+
40+
## Privacy budget
41+
42+
To ensure privacy in systems where multiple queries are allowed, differential privacy defines a rate limit. This limit is known as a **privacy budget**. Privacy budgets are allocated an epsilon amount, typically between 1 and 3 to limit the risk of reidentification. As reports are generated, privacy budgets keep track of the epsilon value of individual reports as well as the aggregate for all reports. After a privacy budget is spent or depleted, users can no longer access data.
43+
44+
## Reliability of data
45+
46+
Although the preservation of privacy should be the goal, there is a tradeoff when it comes to usability and reliability of the data. In data analytics, accuracy can be thought of as a measure of uncertainty introduced by sampling errors. This uncertainty tends to fall within certain bounds. **Accuracy** from a differential privacy perspective instead measures the reliability of the data, which is affected by the uncertainty introduced by the privacy mechanisms. In short, a higher level of noise or privacy translates to data that has a lower epsilon, accuracy, and reliability. Although the data is more private, because it's not reliable, the less likely it is to be used.
47+
48+
## Implementing differentially private systems
49+
50+
Implementing differentially private systems is difficult. WhiteNoise is an open-source project that contains different components for building global differentially private systems. WhiteNoise is made up of the following top-level components:
51+
52+
- Core
53+
- System
54+
55+
### Core
56+
57+
The core library includes the following privacy mechanisms for implementing a differentially private system:
58+
59+
|Component |Description |
60+
|---------|---------|
61+
|Analysis | A graph description of arbitrary computations. |
62+
|Validator | A Rust library that contains a set of tools for checking and deriving the necessary conditions for an analysis to be differentially private. |
63+
|Runtime | The medium to execute the analysis. The reference runtime is written in Rust but runtimes can be written using any computation framework such as SQL and Spark depending on your data needs. |
64+
|Bindings | Language bindings and helper libraries to build analyses. Currently WhiteNoise provides Python bindings. |
65+
66+
### System
67+
68+
The system library provides the following tools and services for working with tabular and relational data:
69+
70+
|Component |Description |
71+
|---------|---------|
72+
|Data Access | Library that intercepts and processes SQL queries and produces reports. This library is implemented in Python and supports the following ODBC and DBAPI data sources:<ul><li>PostgreSQL</li><li>SQL Server</li><li>Spark</li><li>Preston</li><li>Pandas</li></ul>|
73+
|Service | Execution service that provides a REST endpoint to serve requests or queries against shared data sources. The service is designed to allow composition of differential privacy modules that operate on requests containing different delta and epsilon values, also known as heterogeneous requests. This reference implementation accounts for additional impact from queries on correlated data. |
74+
|Evaluator | Stochastic evaluator that checks for privacy violations, accuracy, and bias. The evaluator supports the following tests: <ul><li>Privacy Test - Determines whether a report adheres to the conditions of differential privacy.</li><li>Accuracy Test - Measures whether the reliability of reports falls within the upper and lower bounds given a 95% confidence level.</li><li>Utility Test - Determines whether the confidence bounds of a report are close enough to the data while still maximizing privacy.</li><li>Bias Test - Measures the distribution of reports for repeated queries to ensure they are not unbalanced</li></ul> |
75+
76+
## Next steps
77+
78+
To learn how to use the components of WhiteNoise, check out the GitHub repositories for [WhiteNoise Core package](https://github.com/opendifferentialprivacy/whitenoise-core), [WhiteNoise System package](https://github.com/opendifferentialprivacy/whitenoise-system) and [WhiteNoise samples](https://github.com/opendifferentialprivacy/whitenoise-samples).
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: 'Assess and mitigate fairness in machine learning models'
3+
titleSuffix: Azure Machine Learning
4+
description: Learn about fairness in machine learning models and how the Fairlearn Python package can help you build fairer models.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
ms.author: luquinta
10+
author: luisquintanilla
11+
ms.date: 05/02/2020
12+
#Customer intent: As a data scientist, I want to learn about assessing and mitigating fairness in machine learning models.
13+
---
14+
15+
# Fairness in machine learning models
16+
17+
Learn about fairness in machine learning and how the Fairlearn open-source Python package can help you build models that are more fair.
18+
19+
## What is fairness in machine learning systems?
20+
21+
Artificial intelligence and machine learning systems can display unfair behavior. One way to define unfair behavior is by its harm, or impact on people. There are many types of harm that AI systems can give rise to. Two common types of AI-caused harms are:
22+
23+
- Harm of allocation: An AI system extends or withholds opportunities, resources, or information. Examples include hiring, school admissions, and lending where a model might be much better at picking good candidates among a specific group of people than among other groups.
24+
25+
- Harm of quality-of-service: An AI system does not work as well for one group of people as it does for another. As an example, a voice recognition system might fail to work as well for women as it does for men.
26+
27+
To reduce unfair behavior in AI systems, you have to assess and mitigate these harms.
28+
29+
>[!NOTE]
30+
> Fairness is a socio-technical challenge. Many aspects of fairness, such as justice and due process, are not captured in quantitative fairness metrics. Also, many quantitative fairness metrics can't all be satisfied simultaneously. The goal is to enable humans to assess different mitigation strategies and then make trade-offs that are appropriate to their scenario.
31+
32+
## Fairness assessment and mitigation with Fairlearn
33+
34+
Fairlearn is an open-source Python package that allows machine learning systems developers to assess their systems' fairness and mitigate the observed fairness issues.
35+
36+
Fairlearn has two components:
37+
38+
- Assessment Dashboard: A Jupyter notebook widget for assessing how a model's predictions affect different groups. It also enables comparing multiple models by using fairness and performance metrics.
39+
- Mitigation Algorithms: A set of algorithms to mitigate unfairness in binary classification and regression.
40+
41+
Together, these components enable data scientists and business leaders to navigate any trade-offs between fairness and performance, and to select the mitigation strategy that best fits their needs.
42+
43+
## Fairness assessment
44+
45+
In Fairlearn, fairness is conceptualized though an approach known as **group fairness**, which asks: Which groups of individuals are at risk for experiencing harms?
46+
47+
The relevant groups, also known as subpopulations, are defined through **sensitive features** or sensitive attributes. Sensitive features are passed to a Fairlearn estimator as a vector or a matrix called `sensitive_features`. The term suggests that the system designer should be sensitive to these features when assessing group fairness. Something to be mindful of is whether these features contain privacy implications due to personally identifiable information. But the word "sensitive" doesn't imply that these features shouldn't be used to make predictions.
48+
49+
During assessment phase, fairness is quantified through disparity metrics. **Disparity metrics** can evaluate and compare model's behavior across different groups either as ratios or as differences. Fairlearn supports two classes of disparity metrics:
50+
51+
52+
- Disparity in model performance: These sets of metrics calculate the disparity (difference) in the values of the selected performance metric across different subgroups. Some examples include:
53+
54+
- disparity in accuracy rate
55+
- disparity in error rate
56+
- disparity in precision
57+
- disparity in recall
58+
- disparity in MAE
59+
- many others
60+
61+
- Disparity in selection rate: This metric contains the difference in selection rate among different subgroups. An example of this is disparity in loan approval rate. Selection rate means the fraction of datapoints in each class classified as 1 (in binary classification) or distribution of prediction values (in regression).
62+
63+
## Unfairness mitigation
64+
65+
### Parity constraints
66+
67+
Fairlearn includes a variety of unfairness mitigation algorithms. These algorithms support a set of constraints on the predictor's behavior called **parity constraints** or criteria. Parity constraints require some aspects of the predictor behavior to be comparable across the groups that sensitive features define (e.g., different races). Fairlearn's mitigation algorithms use such parity constraints to mitigate the observed fairness issues.
68+
69+
Fairlearn supports the following types of parity constraints:
70+
71+
|Parity constraint | Purpose |Machine learning task |
72+
|---------|---------|---------|
73+
|Demographic parity | Mitigate allocation harms | Binary classification, Regression |
74+
|Equalized odds | Diagnose allocation and quality-of-service harms | Binary classification |
75+
|Bounded group loss | Mitigate quality-of-service harms | Regression |
76+
77+
### Mitigation algorithms
78+
79+
Fairlearn provides postprocessing and reduction unfairness mitigation algorithms:
80+
81+
- Reduction: These algorithms take a standard black-box ML estimator (e.g., a LightGBM model) and generate a set of retrained models using a sequence of re-weighted training datasets. For example, applicants of a certain gender might be up-weighted or down-weighted to retrain models and reduce disparities across different gender groups. Users can then pick a model that provides the best trade-off between accuracy (or other performance metric) and disparity, which generally would need to be based on business rules and cost calculations.
82+
- Post-processing: These algorithms take an existing classifier and the sensitive feature as input. Then, they derive a transformation of the classifier's prediction to enforce the specified fairness constraints. The biggest advantage of threshold optimization is its simplicity and flexibility as it does not need to retrain the model.
83+
84+
| Algorithm | Description | Machine learning task | Sensitive features | Supported parity constraints | Algorithm Type |
85+
| --- | --- | --- | --- | --- | --- |
86+
| `ExponentiatedGradient` | Black-box approach to fair classification described in [A Reductions Approach to Fair Classification](https://arxiv.org/abs/1803.02453) | Binary classification | Categorical | [Demographic parity](#parity-constraints), [equalized odds](#parity-constraints) | Reduction |
87+
| `GridSearch` | Black-box approach described in [A Reductions Approach to Fair Classification](https://arxiv.org/abs/1803.02453)| Binary classification | Binary | [Demographic parity](#parity-constraints), [equalized odds](#parity-constraints) | Reduction |
88+
| `GridSearch` | Black-box approach that implements a grid-search variant of Fair Regression with the algorithm for bounded group loss described in [Fair Regression: Quantitative Definitions and Reduction-based Algorithms](https://arxiv.org/abs/1905.12843) | Regression | Binary | [Bounded group loss](#parity-constraints) | Reduction |
89+
| `ThresholdOptimizer` | Postprocessing algorithm based on the paper [Equality of Opportunity in Supervised Learning](https://arxiv.org/abs/1610.02413). This technique takes as input an existing classifier and the sensitive feature, and derives a monotone transformation of the classifier's prediction to enforce the specified parity constraints. | Binary classification | Categorical | [Demographic parity](#parity-constraints), [equalized odds](#parity-constraints) | Post-processing |
90+
91+
## Next steps
92+
93+
- To learn how to use the different components, check out the [Fairlearn GitHub repository](https://github.com/fairlearn/fairlearn/) and [sample notebooks](https://github.com/fairlearn/fairlearn/tree/master/notebooks).
94+
- Learn about preserving data privacy by using [Differential privacy and the WhisperNoise package](concept-differential-privacy.md).

0 commit comments

Comments
 (0)