Merge pull request #282 from NGO-Algorithm-Audit/feature/structural_edits

jfparie · web-flow · commit ae3d69910f5b · 2025-06-10T11:33:05.000+02:00
Feature/structural edits
diff --git a/content/english/technical-tools/BDT.md b/content/english/technical-tools/BDT.md
@@ -81,17 +81,17 @@ team:
 
 #### What does the tool do?
 
-The tool helps find groups where an AI system or algorithm performs differently, which could indicate unfair treatment. This type of monitoring is called *anomaly detection*. It detect deviations using a technique called <a href="https://en.wikipedia.org/wiki/Cluster_analysis" target="_blank">clustering</a>, which groups similar data points together (in clusters). The tool doesn’t need information like gender, nationality, or ethnicity to find deviations. Instead, it uses an `outcome label` to measure deviations in the performace of the system, which you can choose based on your data.
+The tool helps find groups where an AI system or algorithm performs differently, which could indicate unfair treatment. This type of monitoring is called *anomaly detection*. It detect deviations using a technique called <a href="https://en.wikipedia.org/wiki/Cluster_analysis" target="_blank">clustering</a>, which groups similar data points together (in clusters). The tool doesn’t need information like gender, nationality, or ethnicity to find deviations. Instead, it uses an `bias variable` to measure deviations in the performace of the system, which you can choose based on your data.
 
 #### What results does it give?
-The tool finds groups (clusters) where performance of the algorithmic system is significantly deviating. It highlights the group with the worst outcome labels and creates a report called a bias analysis report, which you can download as a PDF. You can also download all the identified groups (clusters) in a .json file. Additionally, the tool provides visual summaries of the results, helping experts dive deeper into the identified deviations. An example can be found below. {{< tooltip tooltip_content="The figure below shows that cluster 0, the cluster with systemic deviating outcome labels, includes a higher-than-average proportion of African-American and a lower-than-average proportion of Caucasian people. For other demographic groups, cluster 0 reflects an average distribution. Additional details about this example are available in the demo dataset." >}}
+The tool finds groups (clusters) where performance of the algorithmic system is significantly deviating. It highlights the group with the worst bias variables and creates a report called a bias analysis report, which you can download as a PDF. You can also download all the identified groups (clusters) in a .json file. Additionally, the tool provides visual summaries of the results, helping experts dive deeper into the identified deviations. An example can be found below. {{< tooltip tooltip_content="The figure below shows that cluster 0, the cluster with systemic deviating bias variable, includes a higher-than-average proportion of African-American and a lower-than-average proportion of Caucasian people. For other demographic groups, cluster 0 reflects an average distribution. Additional details about this example are available in the demo dataset." >}}
 
 <div style="margin-bottom:25px; display: flex; justify-content: center;">
   <img src="/images/BDT/example_COMPAS.png" alt="drawing" width="600px"/>
 </div>
 
 #### What kind of data does it work with?
-The tool works with data in a table format, consisting solely of numbers or categories. You need to pick one column in the data to use as the `outcome label`. This column should have numbers only, and you’ll specify whether a higher or lower number is better. For example, if you’re looking at error rates, lower numbers are better. For accuracy, higher numbers are better. The tool also comes with a demo dataset you can use by clicking "Demo dataset".
+The tool works with data in a table format, consisting solely of numbers or categories. You need to pick one column in the data to use as the `bias variable`. This column should have numbers only, and you’ll specify whether a higher or lower number is better. For example, if you’re looking at error rates, lower numbers are better. For accuracy, higher numbers are better. The tool also comes with a demo dataset you can use by clicking "Demo dataset".
 
 <div>
   <p><u>Example of numerical dataset</u>:</p>
@@ -133,22 +133,22 @@ The unsupervised bias detection tool performs a series of steps:
 ##### Required preparations by the user:
 <span style="color:#005AA7">Step 1. Data:</span> the user shoulds prepare the following aspects relating to the processed data:
 - <span style="color:#005AA7">Dataset:</span> The data must be provided in a tabular format. Any missing values should be removed or replaced. 
-- <span style="color:#005AA7">Type of data:</span> All columns, except the outcome label column, should have uniform data types, e.g., either all numerical or all categorical. The user selects whether numerical of categorical data are processed.
-- <span style="color:#005AA7">Outcome label:</span> A column should be selected from the dataset to serve as the `outcome label`, which needs to be numerical. In step 4, clustering will be performed based on these numerical values. Examples include metrics such as "being classified as high risk", "error rate" or "selected for an investigation".
+- <span style="color:#005AA7">Type of data:</span> All columns, except the bias variable column, should have uniform data types, e.g., either all numerical or all categorical. The user selects whether numerical of categorical data are processed.
+- <span style="color:#005AA7">Bias variable:</span> A column should be selected from the dataset to serve as the `bias variable`, which needs to be numerical. In step 4, clustering will be performed based on these numerical values. Examples include metrics such as "being classified as high risk", "error rate" or "selected for an investigation".
 
 <span style="color:#005AA7">Step 2. Parameters:</span> the user shoulds set the following hyperparameters:
 - <span style="color:#005AA7">Iterations:</span> How often the data are allowed to be split in smaller clusters, by default 10 iterations are selected.
 - <span style="color:#005AA7">Minimal cluster size:</span> How many datapoints the identified clusters may contain, by deafault set to 1% of the number of rows in the attached dataset. More guidance on well-informed choice of the minimal cluster size can be found in section 3.3 of our [scientific paper](/technical-tools/bdt/#scientific-paper).
-- <span style="color:#005AA7">Outcome label interpretation:</span> How the outcome label should be interpreted. For instance, when error rate or misclassifications are chosen as the outcome label, a lower value is preferred, as the goal is to minimize errors. Conversely, when accuracy or precision is selected as the outcome label, a higher value is preferred, reflecting the aim to maximize performance.
+- <span style="color:#005AA7">Bias variable interpretation:</span> How the bias variable should be interpreted. For instance, when error rate or misclassifications are chosen as the bias variable, a lower value is preferred, as the goal is to minimize errors. Conversely, when accuracy or precision is selected as the bias variable, a higher value is preferred, reflecting the aim to maximize performance.
 
 ##### Performed by the tool:
 <span style="color:#005AA7">Step 3. Train-test data:</span> The dataset is divided into train and test subset, following a 80-20 ratio.
 
 <span style="color:#005AA7">Step 4. Hierarchical Bias-Aware Clustering (HBAC):</span> The HBAC algorithm (detailed below) is applied to the train dataset. The centroids of the resulting clusters are saved and later used to assign cluster labels to data points in the test dataset.
 
-<span style="color:#005AA7">Step 5. Testing cluster differences wrt. outcome labels:</span> Statistical hypothesis testing is performed to evaluate whether the outcome labels differ significantly in the most deviating cluster compared to the rest of the dataset. A t-test is used to compare the means of the outcome labels.
+<span style="color:#005AA7">Step 5. Testing cluster differences wrt. bias variable:</span> Statistical hypothesis testing is performed to evaluate whether the bias variable differ significantly in the most deviating cluster compared to the rest of the dataset. A t-test is used to compare the means of the bias variable.
 
-<span style="color:#005AA7">Step 6. Testing cluster differences wrt. features:</span> If a statistically significant difference in outcome labels between the most deviating cluster and the rest of the dataset occurs, feature diffences are examined. For this, also statistical hypothesis testing is used, namely a t-test in case numercial data and Pearson’s 𝜒2-test in case categorical data are processed. For multiple hypothesis testing, Bonferonni correction should be applied. Further details can be found in section 3.4 of our [scientific paper](/technical-tools/bdt/#scientific-paper).
+<span style="color:#005AA7">Step 6. Testing cluster differences wrt. features:</span> If a statistically significant difference in bias variable between the most deviating cluster and the rest of the dataset occurs, feature diffences are examined. For this, also statistical hypothesis testing is used, namely a t-test in case numercial data and Pearson’s 𝜒2-test in case categorical data are processed. For multiple hypothesis testing, Bonferonni correction should be applied. Further details can be found in section 3.4 of our [scientific paper](/technical-tools/bdt/#scientific-paper).
 
 A schematic overview of the above steps is depicted below.
 
@@ -157,7 +157,7 @@ A schematic overview of the above steps is depicted below.
 </div>
 
 #### How does the clustering algorithm work?
-The *Hierarchical Bias-Aware Clustering* (HBAC) algorithm identifies clusters in the provided dataset based on a user-defined `outcome label`. The objective is to find clusters with low variation in the outcome labels within each cluster. Variation in the outcome labels between clusters should be high. HBAC iteratively finds clusters in the data using k-means (for numerical data) or k-modes clustering (for categorical data).  For the initial split, HBAC takes the full dataset and splits it in two clusters. Cluster `C` – with the highest standard deviation of the outcome label – is selected. Then, cluster `C` is divided into two candidate clusters `C'` and `C''`'. If the average outcome label score in either candidate cluster exceed the the average outcome label score in `C`, the candidate cluster with highest outcome label score is selected as a new cluster. This process repeats until the maximum number of iterations (`max_iterations`) is reached or the resulting cluster fails to meet the minimum size requirement (`n_min`). The pseudo-code of the HBAC algorithm is provided below.
+The *Hierarchical Bias-Aware Clustering* (HBAC) algorithm identifies clusters in the provided dataset based on a user-defined `bias variable`. The objective is to find clusters with low variation in the bias variable within each cluster. Variation in the bias variable between clusters should be high. HBAC iteratively finds clusters in the data using k-means (for numerical data) or k-modes clustering (for categorical data).  For the initial split, HBAC takes the full dataset and splits it in two clusters. Cluster `C` – with the highest standard deviation of the bias variable – is selected. Then, cluster `C` is divided into two candidate clusters `C'` and `C''`'. If the average bias variable in either candidate cluster exceed the the average bias variable in `C`, the candidate cluster with highest bias variable is selected as a new cluster. This process repeats until the maximum number of iterations (`max_iterations`) is reached or the resulting cluster fails to meet the minimum size requirement (`n_min`). The pseudo-code of the HBAC algorithm is provided below.
 
 <div style="display: flex; justify-content: center;">
   <img src="/images/BDT/pseudo_code_HBAC.png" alt="drawing" width="800px"/>
@@ -166,7 +166,7 @@ The *Hierarchical Bias-Aware Clustering* (HBAC) algorithm identifies clusters in
 The HBAC-algorithm is introduced by Misztal-Radecka and Indurkya in a [scientific article](https://www.sciencedirect.com/science/article/abs/pii/S0306457321000285) as published in *Information Processing and Management* in 2021. Our implementation of the HBAC-algorithm advances this implementation by proposing additional methodological checks to distinguish real singals from noise, such as sample splitting, statistical hypothesis testing and measuring cluster stability. Algorithm Audit's implementation of the algorithm can be found in the <a href="https://github.com/NGO-Algorithm-Audit/unsupervised-bias-detection/blob/master/README.md" target="_blank">unsupervised-bias-detection</a> pip package.
 
 #### How should the results of the tool be interpreted?
-The HBAC algorithm maximizes the difference in outcome labels between clusters. To prevent incorrect conclusions that there are unwanted deviations in the decision-making process under review when there truly is none, we split the dataset in training and test data, and hypothesis testing prevents us from (wrongly) concluding that there is a difference in outcome labels while there is none. If a statistically significant deviation is detected, the outcome of the tool serves as a starting point for human experts to assess the identified deviations in the decision-making processes.
+The HBAC algorithm maximizes the difference in bias variable between clusters. To prevent incorrect conclusions that there are unwanted deviations in the decision-making process under review when there truly is none, we split the dataset in training and test data, and hypothesis testing prevents us from (wrongly) concluding that there is a difference in bias variable while there is none. If a statistically significant deviation is detected, the outcome of the tool serves as a starting point for human experts to assess the identified deviations in the decision-making processes.
 
 {{< container_close >}}
 
diff --git a/content/english/technical-tools/SDG.md b/content/english/technical-tools/SDG.md
@@ -193,7 +193,7 @@ More documentation about the tool and underlying SDG methods can be found on <a
 
 <!-- Web app -->
 
-{{< iframe src="https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html?lang=en" title="Synthetic data generation tool" icon="fas fa-search" id="web-app" height="800px" >}}
+{{< iframe src="https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html?lang=en" title="Synthetic data generation tool" icon="fas fa-table" id="web-app" height="800px" >}}
 
 
 
diff --git a/content/nederlands/technical-tools/SDG.md b/content/nederlands/technical-tools/SDG.md
@@ -23,7 +23,7 @@ team:
       bio: |
         Freelance software developer
 quick_navigation:
-  title: Content overview
+  title: Inhoudsopgave
   links:
     - title: Introductie
       url: '#info'
@@ -42,7 +42,7 @@ quick_navigation:
       url: '#privacy-legal'
     - title: Local-only architectuur
       url: '#local-only'
-    - title: Supported by
+    - title: Ondersteund door
       url: '#supported-by'
     - title: Team
       url: '#team'
@@ -191,7 +191,7 @@ Meer documentatie over de tool en onderliggende SDG methoden kunnen worden gevon
 
 <!-- Web app -->
 
-{{< iframe src="https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html?lang=nl" title="Synthetische data generatie tool" icon="fas fa-search" id="web-app" height="800px" >}}
+{{< iframe src="https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html?lang=nl" title="Synthetische data generatie tool" icon="fas fa-table" id="web-app" height="800px" >}}
 
 
 

Original file line number	Diff line number	Diff line change
`@@ -193,7 +193,7 @@ More documentation about the tool and underlying SDG methods can be found on <a`
`193`	`193`
`194`	`194`	`<!-- Web app -->`
`195`	`195`
`196`		`-{{< iframe src="https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html?lang=en" title="Synthetic data generation tool" icon="fas fa-search" id="web-app" height="800px" >}}`
	`196`	`+{{< iframe src="https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html?lang=en" title="Synthetic data generation tool" icon="fas fa-table" id="web-app" height="800px" >}}`
`197`	`197`
`198`	`198`
`199`	`199`