Skip to content

Commit ccb47a8

Browse files
authored
Merge pull request #109 from NGO-Algorithm-Audit/feature/ubdt-tweaks-27jun2025
Feature/ubdt tweaks 27jun2025
2 parents 19cea34 + 615d3a3 commit ccb47a8

File tree

5 files changed

+223
-194
lines changed

5 files changed

+223
-194
lines changed

notebooks/unsupervised bias detection tool/COMPAS_FP.ipynb

Lines changed: 53 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,7 @@
466466
"source": [
467467
"**UI text #1**\n",
468468
"\n",
469-
"In this example, we analyze which group is most adversely affected by the risk prediction algorithm. We do this by applying the clustering algorithm on the dataset previewed below. The column `is_recid` indicates whether a defendant reoffended or not (1: yes, 0: no). The `score_text` column indicates whether a defendant was predicted to reoffend (1: yes, 0: no). The column `false_positive` (FP) represents cases where a defendant was predicted to reoffended by the algorithm, but didn't do so (1: FP, 0: no FP). A preview of the data can be found below. The column `false_positive` is used as the `bias score`.\n",
469+
"In this example, we analyze which group is most adversely affected by the risk prediction algorithm. We do this by applying the clustering algorithm on the dataset previewed below. The column `is_recid` indicates whether a defendant reoffended or not (1: yes, 0: no). The `score_text` column indicates whether a defendant was predicted to reoffend (1: yes, 0: no). The column `false_positive` (FP) represents cases where a defendant was predicted to reoffended by the algorithm, but didn't do so (1: FP, 0: no FP). A preview of the data can be found below. The column `false_positive` is used as the `bias variable`.\n",
470470
"\n",
471471
"**1. Preview of data**\n",
472472
"\n",
@@ -591,15 +591,24 @@
591591
"\n",
592592
"# Create a false_positive column\n",
593593
"filtered_df[\"false_positive\"] = ((filtered_df[\"is_recid\"] == 0) & (filtered_df[\"score_text\"] == 1))\n",
594+
"\n",
595+
"# Assign bias variable\n",
594596
"bias_variable = \"false_positive\"\n",
595597
"\n",
596598
"# Display the updated dataframe\n",
597599
"filtered_df.head()"
598600
]
599601
},
602+
{
603+
"cell_type": "markdown",
604+
"metadata": {},
605+
"source": [
606+
"Encode to original format (only for categorical data)"
607+
]
608+
},
600609
{
601610
"cell_type": "code",
602-
"execution_count": null,
611+
"execution_count": 5,
603612
"metadata": {},
604613
"outputs": [
605614
{
@@ -1391,6 +1400,13 @@
13911400
"cluster_label_X_test"
13921401
]
13931402
},
1403+
{
1404+
"cell_type": "markdown",
1405+
"metadata": {},
1406+
"source": [
1407+
"Decode to original format (only for categorical data)"
1408+
]
1409+
},
13941410
{
13951411
"cell_type": "code",
13961412
"execution_count": 14,
@@ -1700,7 +1716,10 @@
17001716
}
17011717
],
17021718
"source": [
1719+
"# attach bias variable again to the decoded test set (only for categorical data)\n",
17031720
"decoded_X_test[bias_variable] = y_test.values\n",
1721+
"\n",
1722+
"# attach predicted cluster label to the decoded test set\n",
17041723
"decoded_X_test[\"cluster_label\"] = cluster_label_X_test\n",
17051724
"decoded_X_test.head()"
17061725
]
@@ -1769,8 +1788,8 @@
17691788
"output_type": "stream",
17701789
"text": [
17711790
"The label indicating the most disavanteagous bias: -1\n",
1772-
"Most biased cluster: 83/1169 (0.071)\n",
1773-
"Rest of dataset: 4/274 (0.015)\n",
1791+
"Most biased cluster in test set: 83/1169 (0.071)\n",
1792+
"Rest of test set: 4/274 (0.015)\n",
17741793
"Z-statistic: 3.5304\n",
17751794
"P-value: 0.0002\n",
17761795
"The bias variable occurs statistically significant more often than in the rest of the dataset.\n"
@@ -1794,8 +1813,8 @@
17941813
"z_stat, p_val = proportions_ztest(counts, nobs, alternative='larger')\n",
17951814
"\n",
17961815
"print(f\"The label indicating the most disavanteagous bias: {most_biased_cluster_label}\")\n",
1797-
"print(f\"Most biased cluster: {most_biased_count}/{most_biased_total} ({most_biased_count/most_biased_total:.3f})\")\n",
1798-
"print(f\"Rest of dataset: {rest_count}/{rest_total} ({rest_count/rest_total:.3f})\")\n",
1816+
"print(f\"Most biased cluster in test set: {most_biased_count}/{most_biased_total} ({most_biased_count/most_biased_total:.3f})\")\n",
1817+
"print(f\"Rest of test set: {rest_count}/{rest_total} ({rest_count/rest_total:.3f})\")\n",
17991818
"print(f\"Z-statistic: {z_stat:.4f}\")\n",
18001819
"print(f\"P-value: {p_val:.4f}\")\n",
18011820
"\n",
@@ -1826,7 +1845,7 @@
18261845
"cell_type": "markdown",
18271846
"metadata": {},
18281847
"source": [
1829-
"**Accordion 1**\n",
1848+
"**Accordion 'Features per cluster'**\n",
18301849
"\n",
18311850
"[if p<0.05]\n",
18321851
"\n",
@@ -1853,17 +1872,18 @@
18531872
],
18541873
"source": [
18551874
"# Group by cluster_label and count the occurrences\n",
1856-
"cluster_counts = decoded_X_test[\"cluster_label\"].value_counts()\n",
1875+
"df = decoded_X_test\n",
1876+
"cluster_counts = df[\"cluster_label\"].value_counts()\n",
18571877
"\n",
18581878
"# Create subplots for each column\n",
1859-
"columns_to_analyze = decoded_X_test.columns.drop(['cluster_label', bias_variable]) # exclude cluster_label and bias variable\n",
1879+
"columns_to_analyze = df.columns.drop(['cluster_label', bias_variable]) # exclude cluster_label and bias variable\n",
18601880
"rows = (len(columns_to_analyze) + 2) // 3 # Calculate the number of rows needed\n",
18611881
"fig, axes = plt.subplots(rows, min(len(columns_to_analyze), 3), figsize=(15, 3 * rows), squeeze=False)\n",
18621882
"axes = axes.flatten() # Flatten the axes array for easier indexing\n",
18631883
"\n",
18641884
"for i, column in enumerate(columns_to_analyze):\n",
18651885
" # Group by cluster_label and the column, then calculate percentages\n",
1866-
" grouped_data = decoded_X_test.groupby([\"cluster_label\", column]).size().unstack(fill_value=0)\n",
1886+
" grouped_data = df.groupby([\"cluster_label\", column]).size().unstack(fill_value=0)\n",
18671887
" percentages = grouped_data.div(grouped_data.sum(axis=1), axis=0) * 100\n",
18681888
" \n",
18691889
" # Plot the percentage data without legend\n",
@@ -1873,7 +1893,7 @@
18731893
" axes[i].set_xticklabels(percentages.T.index, rotation=0)\n",
18741894
" \n",
18751895
" # Calculate and plot the average percentage in the entire dataset for each category value\n",
1876-
" overall_counts = decoded_X_test[column].value_counts(normalize=True) * 100\n",
1896+
" overall_counts = df[column].value_counts(normalize=True) * 100\n",
18771897
" for cat_value, avg_pct in overall_counts.items():\n",
18781898
" # Find the x position for this category value\n",
18791899
" try:\n",
@@ -1908,7 +1928,7 @@
19081928
},
19091929
{
19101930
"cell_type": "code",
1911-
"execution_count": null,
1931+
"execution_count": 20,
19121932
"metadata": {},
19131933
"outputs": [],
19141934
"source": [
@@ -1979,9 +1999,28 @@
19791999
" print(f\"{var[0]}: '{var[1]}' doesn't occur statistically significant more or less often than in the rest of the dataset.\\033[0m\")"
19802000
]
19812001
},
2002+
{
2003+
"cell_type": "markdown",
2004+
"metadata": {},
2005+
"source": [
2006+
"**Accordion 'Statistical significant difference wrt. cluster features'**\n",
2007+
"\n",
2008+
"[if p<0.05]\n",
2009+
"\n",
2010+
"**UI text #8**\n",
2011+
"\n",
2012+
"The following statistical test is conducted for each feature:\n",
2013+
"\n",
2014+
"$H_0$: feature doesn't occur more often in most deviating cluster compared to the rest of the dataset\n",
2015+
"\n",
2016+
"$H_1$: feature does occur more often in most deviating cluster compared to the rest of the dataset\n",
2017+
"\n",
2018+
"For categorical data a two-sided chi-squared-test, while for numerical data a two-sided t-test is used. To account for multiple hypothesis testing Bonferroni correction is applied."
2019+
]
2020+
},
19822021
{
19832022
"cell_type": "code",
1984-
"execution_count": 22,
2023+
"execution_count": 21,
19852024
"metadata": {},
19862025
"outputs": [
19872026
{
@@ -2023,7 +2062,7 @@
20232062
"\n",
20242063
"**UI text #9**\n",
20252064
"\n",
2026-
"**7. Bias report**\n",
2065+
"**7. Conclusion and bias report**\n",
20272066
"\n",
20282067
"[Download]"
20292068
]

0 commit comments

Comments
 (0)