Skip to content

Commit 391443b

Browse files
committed
BDT content update
1 parent d639e51 commit 391443b

File tree

6 files changed

+52
-32
lines changed

6 files changed

+52
-32
lines changed

content/english/technical-tools/BDT.md

Lines changed: 52 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,9 @@ quick_navigation:
1515
url: '#web-app'
1616
- title: Source code
1717
url: '#source-code'
18-
- title: Anomaly detection algorithm
19-
url: '#HBAC'
2018
- title: Scientific paper and audit report
2119
url: '#scientific-paper'
22-
- title: Local-first computing
20+
- title: Local-first architecture
2321
url: '#local-first'
2422
- title: Supported by
2523
url: '#supported-by'
@@ -64,14 +62,18 @@ team:
6462
name: Joel Persson PhD
6563
bio: |
6664
Research Scientist, Spotify
65+
- image: /images/people/JParie.jpg
66+
name: Jurriaan Parie
67+
bio: |
68+
Director, Algorithm Audit
6769
- image: /images/people/KPadh.jpeg
6870
name: Kirtan Padh
6971
bio: |
7072
PhD-candidate Causal Inference and Machine Learning, TU München
7173
- image: /images/people/KProrokovic.jpeg
7274
name: Krsto Proroković
7375
bio: |
74-
PhD-candidate, Swiss AI Lab IDSIA
76+
Freelance software developer and AI researcher
7577
- image: /images/people/MJorgensen.jpeg
7678
name: Mackenzie Jorgensen PhD
7779
bio: |
@@ -85,23 +87,19 @@ type: bias-detection-tool
8587

8688
<br>
8789

88-
#### What is the tool about?
89-
90-
The tool identifies groups where an algorithm or AI system shows variations in performance. This type of monitoring is referred to as *anomaly detection*. To identify anomalous patterns, the tool uses <a href="https://en.wikipedia.org/wiki/Cluster_analysis" target="_blank">clustering</a>. Clustering is a form of *unsupervised learning*. This means detecting disparate treatment (bias) does not require any data on protected attributes of users, such as gender, nationality, or ethnicity. The metric used to measure bias can be manually selected and is referred to as the `bias metric`.
91-
92-
#### What data can be processed?
93-
94-
The tool processes all data in table format. The type of data (numerical, categorical, time, etc.) is automatically detected. One column must be selected as the `bias metric` – which should be a numerical value. The user must specify whether a high or low value of the `bias metric` is considered better. For example: for an error rate, a low value is better, while for accuracy, a high value is better.
90+
#### What does the tool do?
91+
The tool helps find groups where an AI system or algorithm performs differently, which could indicate unfair treatment or bias. It does this using a technique called <a href="https://en.wikipedia.org/wiki/Cluster_analysis" target="_blank">clustering</a>, which groups similar data points together (in a cluster). The tool doesn’t need information like gender, nationality, or ethnicity to find these patterns. Instead, it uses a `bias metric` to measure deviations in the performace of the algorithmic system, which you can choose based on your data.
9592

96-
The tool contains a demo data for which output is generated. Hit the 'Try it out' button.
93+
#### What kind of data does it work with?
94+
The tool works with data in a table format, consisting solely of numbers or categories. You just need to pick one column in the data to use as the `bias metric`. This column should have numbers only, and you’ll specify whether a higher or lower number is better. For example, if you’re looking at error rates, lower numbers are better. For accuracy, higher numbers are better. The tool also comes with a demo dataset you can use by clicking "Try it out."
9795

9896
<div>
9997
<p><u>Example of numerical data set</u>:</p>
10098
<style type="text/css">.tg{border-collapse:collapse;border-spacing:0}.tg td{border-color:#000;border-style:solid;border-width:1px;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal}.tg th{border-color:#000;border-style:solid;border-width:1px;font-size:14px;font-weight:400;overflow:hidden;padding:10px 5px;word-break:normal}.tg .tg-uox0{border-color:#grey;font-weight:700;text-align:left;vertical-align:top}.tg .tg-uoz0{border-color:#grey;text-align:left;vertical-align:top} .tg-1wig{font-weight:700;text-align:left;vertical-align:top}.tg .tg-0lax{text-align:left;vertical-align:top}</style>
10199
<table class="tg">
102100
<thead>
103101
<tr>
104-
<th class="tg-uox0">Age</th><th class="tg-uox0">Income</th><th class="tg-uox0">...</th><th class="tg-uox0">Number of cars</th><th class="tg-uox0"><span style="font-family:SFMono-Regular,Menlo,Monaco,Consolas,liberation mono,courier new,monospace; color:#e83e8c; font-weight:300">Selected for control</span></th>
102+
<th class="tg-uox0">Age</th><th class="tg-uox0">Income</th><th class="tg-uox0">...</th><th class="tg-uox0">Number of cars</th><th class="tg-uox0"><span style="font-family:SFMono-Regular,Menlo,Monaco,Consolas,liberation mono,courier new,monospace; color:#e83e8c; font-weight:300">Selected for investigation</span></th>
105103
</tr>
106104
</thead>
107105
<tbody>
@@ -114,21 +112,52 @@ The tool contains a demo data for which output is generated. Hit the 'Try it out
114112
</div>
115113
<br>
116114

117-
#### What does the tool return?
115+
#### What results does it give?
116+
The tool finds groups (clusters) where performance of the algorithmic system is significantly different. It highlights the group with the worst performance and creates a report called a bias analysis report, which you can download as a PDF. You can also download all the identified groups in a .json file. Additionally, the tool provides visual summaries of the results, helping experts dive deeper into the identified deviations.
117+
118+
#### Is my data safe?
119+
Yes! Your data stays on your computer and never leaves your organization’s environment. The tool runs directly in your browser, using your computer’s power to analyze the data. This setup, called 'local-first', ensures no data is sent to cloud providers or third parties. Instructions for hosting the tool securely within your organization are available on <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
120+
121+
Try the tool below ⬇️
122+
123+
{{< container_close >}}
124+
125+
<!-- Technical details -->
126+
127+
{{< container_open isAccordion="true" title="Technical details – Unsupervised bias detection tool" id="technical-introduction" >}}
128+
129+
<br>
130+
131+
#### Steps undertaken by the tool
132+
The unsupervised bias detection tool operates a series of steps:
118133

119-
The tool identifies deviating clusters. A summary of the results is made available in a bias analysis report that can be downloaded as a pdf. All identified clusters can be downloaded in a .json file. The tool specifically focuses on the most negatively deviating cluster and provides a description of this cluster. These results serve as a starting point for further investigation by domain experts, who can assess whether the observed disparities are indeed undesirable. The tool also visualizes the outcomes.
134+
##### Prepared by the user:
135+
<span style="color:#005AA7">1. Dataset:</span> The data must be provided in a tabular format. All columns, except the bias metric column, should have uniform data types, e.g., either all numerical or all categorical. The bias metric column must be numerical. Any missing values should be removed or replaced. The dataset should then be divided into training and testing subset, following a 80-20 ratio.
120136

121-
#### Overview of process
137+
<span style="color:#005AA7">2. Bias metric:</span> The user selects one column from the dataset to serve as the `bias metric`. In step 3, clustering will be performed based on this chosen `bias metric`. The chosen bias metric must be numerical. Examples include metrics such as "being classified as high risk", "error rate" or "selected for an investigation".
138+
139+
##### Performed by the tool:
140+
<span style="color:#005AA7">3. Hierarchical Bias-Aware Clustering (HBAC):</span> The HBAC algorithm (detailed below) is applied to the training dataset. The centroids of the resulting clusters are saved and later used to assign cluster labels to data points in the test dataset.
141+
142+
<span style="color:#005AA7">4. Testing differences in bias metric:</span> Statistical hypothesis testing is performed to evaluate whether the most deviating cluster contains significantly more bias compared to the rest of the dataset. A two-sample t-test is used to compare the bias metrics between clusters. For multiple hypothesis testing, Bonferonni correction should be applied. Further details can are available in our [scientific paper](/technical-tools/bdt/#scientific-paper).
143+
144+
A schematic overview of the above steps is depicted below.
122145

123146
<div style="margin-bottom:50px; display: flex; justify-content: center;">
124-
<img src="/images/BDT/overview_tool.png" alt="drawing" width="600px"/>
147+
<img src="/images/BDT/overview_tool.png" alt="drawing" width="800px"/>
125148
</div>
126149

127-
#### How is my data processed?
150+
#### How does the clustering algorithm work?
151+
The *Hierarchical Bias-Aware Clustering* (HBAC) algorithm identifies clusters in the provided dataset based on a user-defined `bias metric`. The objective is to find clusters with low variation in the bias metric within each cluster and significant variation between clusters. HBAC iteratively finds clusters in the data using k-means (for numerical data) or k-modes clustering (for categorical data). For the initial split, HBAC takes the full dataset and splits it in two clusters. Cluster `C` – with the highest standard deviation of the bias metric – is selected. Then, cluster `C` is divided into two candidate clusters `C'` and `C''`'. If the average bias metric in either candidate cluster exceed the the average bias metric in `C`, the candidate cluster with highest bias metric is selected as a new cluster. This process repeats until the maximum number of iterations (`max_iterations`) is reached or the resulting cluster fails to meet the minimum size requirement (`n_min`). The pseudo-code of the HBAC algorithm is provided below.
128152

129-
The tool is privacy-friendly because the data is processed entirely within the browser. The data does not leave your computer or the environment of your organization. The tool utilizes the computing power of your own computer to analyze the data. This type of browser-based software is referred to as *local-first*. The tool does not upload data to third parties, such as cloud providers. Instructions on how to host the tool and local-first architecture can be hosted locally within your own organization can be found on <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
153+
<div style="display: flex; justify-content: center;">
154+
<img src="/images/BDT/pseudo_code_HBAC.png" alt="drawing" width="800px"/>
155+
</div>
130156

131-
Try the tool below ⬇️
157+
The HBAC-algorithm is introduced by Misztal-Radecka and Indurkya in a [scientific article](https://www.sciencedirect.com/science/article/abs/pii/S0306457321000285) as published in *Information Processing and Management* in 2021. Our implementation of the HBAC-algorithm advances this implementation by proposing additional methodological checks to distinguish real bias from noise, such as sample splitting, statistical hypothesis testing and measuring cluster stability. Algorithm Audit's implementation of the algorithm can be found in the <a href="https://github.com/NGO-Algorithm-Audit/unsupervised-bias-detection/blob/master/README.md" target="_blank">unsupervised-bias-detection</a> pip package.
158+
159+
#### How should the results of the tool be interpreted?
160+
The HBAC algorithm maximizes the difference in the bias metric between clusters. To prevent incorrect conclusions that there is bias in the decision-making process under review when there truly is none, we split the dataset in training and test data, and hypothesis testing prevents us from (wrongly) concluding that there is a difference in the bias metric while there is none. If statistically significant bias is detected, the outcome of the tool serves as a starting point for human experts to assess potential discrimination in the decision-making processes.
132161

133162
{{< container_close >}}
134163

@@ -145,16 +174,8 @@ Try the tool below ⬇️
145174
{{< container_open title="Source code" id="source-code" icon="fas fa-toolbox" >}}
146175

147176
* The source code of the anolamy detection-algorithm is available on <a href="https://github.com/NGO-Algorithm-Audit/unsupervised-bias-detection" target="_blank">Github</a> and as a <a href="https://pypi.org/project/unsupervised-bias-detection/" target="_blank">pip package</a>: `pip install unsupervised-bias-detection`.
148-
[![!pypi](https://img.shields.io/pypi/v/unsupervised-bias-detection?logo=pypi\&color=blue)](https://pypi.org/project/unsupervised-bias-detection/)
149-
* The architecture to run web apps local-first is also available on <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
150177

151-
{{< container_close >}}
152-
153-
<!-- Anolamy detection algorithm -->
154-
155-
{{< container_open title="Anolamy detection algorithm – Hierarchical Bias-Aware Clustering (HBAC)" icon="fas fa-code-branch" id="HBAC" >}}
156-
157-
The tool uses the *Hierarchical Bias-Aware Clustering* (HBAC) algorithm. HBAC processes input data according to the k-means (for numerical data) or k-modes (for categorical data) clustering algorithm. The HBAC-algorithm is introduced by Misztal-Radecka and Indurkya in a [scientific article](https://www.sciencedirect.com/science/article/abs/pii/S0306457321000285) as published in *Information Processing and Management* (2021). Our implementation of the HBAC-algorithm, including additional methodological checks to distinguish real bias from noise, such as sample splitting, statistical hypothesis testing and measuring cluster stability, can be found in the <a href="https://github.com/NGO-Algorithm-Audit/unsupervised-bias-detection/blob/master/README.md" target="_blank">unsupervised-bias-detection</a> pip package.
178+
* The architecture to run web apps local-first is also available on <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
158179

159180
{{< container_close >}}
160181

@@ -174,9 +195,8 @@ The unsupervised bias detection tool has been applied in practice to audit a Dut
174195

175196
<br>
176197

177-
#### What is local-first computing?
178-
179-
Local-first computing is the opposite of cloud computing: the data is not uploaded to third-parties, such as a cloud providers, and is processed by your own computer. The data attached to the tool therefore doesn't leave your computer or the environment of your organization. The tool is privacy-friendly because the data can be processed within the mandate of your organisation and doesn't need to be shared with new parties. The unsupervised bias detection tool can also be hosted locally within your organization. Instructions, including the source code or the web app, can be found on <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
198+
#### What is local-first?
199+
Local-first computing is the opposite of cloud computing: the data is not uploaded to third-parties, such as a cloud providers, but is processed by your own computer. The data attached to the tool therefore doesn't leave your computer or the environment of your organization. The tool is privacy-friendly because the data can be processed within the mandate of your organisation and doesn't need to be shared with new parties. The unsupervised bias detection tool can also be hosted locally within your organization. Instructions, including the source code or the web app, can be found on <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
180200

181201
#### Overview of local-first architecture
182202

1.82 KB
Loading
408 Bytes
Loading
110 KB
Loading
126 KB
Loading
145 KB
Loading

0 commit comments

Comments
 (0)