|
466 | 466 | "source": [ |
467 | 467 | "**UI text #1**\n", |
468 | 468 | "\n", |
469 | | - "In this example, we analyze which group is most adversely affected by the risk prediction algorithm. We do this by applying the clustering algorithm on the dataset previewed below. The column `is_recid` indicates whether a defendant reoffended or not (1: yes, 0: no). The `score_text` column indicates whether a defendant was predicted to reoffend (1: yes, 0: no). The column `false_positive` (FP) represents cases where a defendant was predicted to reoffended by the algorithm, but didn't do so (1: FP, 0: no FP). A preview of the data can be found below. The column `false_positive` is used as the `bias score`.\n", |
| 469 | + "In this example, we analyze which group is most adversely affected by the risk prediction algorithm. We do this by applying the clustering algorithm on the dataset previewed below. The column `is_recid` indicates whether a defendant reoffended or not (1: yes, 0: no). The `score_text` column indicates whether a defendant was predicted to reoffend (1: yes, 0: no). The column `false_positive` (FP) represents cases where a defendant was predicted to reoffended by the algorithm, but didn't do so (1: FP, 0: no FP). A preview of the data can be found below. The column `false_positive` is used as the `bias variable`.\n", |
470 | 470 | "\n", |
471 | 471 | "**1. Preview of data**\n", |
472 | 472 | "\n", |
|
591 | 591 | "\n", |
592 | 592 | "# Create a false_positive column\n", |
593 | 593 | "filtered_df[\"false_positive\"] = ((filtered_df[\"is_recid\"] == 0) & (filtered_df[\"score_text\"] == 1))\n", |
| 594 | + "\n", |
| 595 | + "# Assign bias variable\n", |
594 | 596 | "bias_variable = \"false_positive\"\n", |
595 | 597 | "\n", |
596 | 598 | "# Display the updated dataframe\n", |
597 | 599 | "filtered_df.head()" |
598 | 600 | ] |
599 | 601 | }, |
| 602 | + { |
| 603 | + "cell_type": "markdown", |
| 604 | + "metadata": {}, |
| 605 | + "source": [ |
| 606 | + "Encode to original format (only for categorical data)" |
| 607 | + ] |
| 608 | + }, |
600 | 609 | { |
601 | 610 | "cell_type": "code", |
602 | | - "execution_count": null, |
| 611 | + "execution_count": 5, |
603 | 612 | "metadata": {}, |
604 | 613 | "outputs": [ |
605 | 614 | { |
|
1391 | 1400 | "cluster_label_X_test" |
1392 | 1401 | ] |
1393 | 1402 | }, |
| 1403 | + { |
| 1404 | + "cell_type": "markdown", |
| 1405 | + "metadata": {}, |
| 1406 | + "source": [ |
| 1407 | + "Decode to original format (only for categorical data)" |
| 1408 | + ] |
| 1409 | + }, |
1394 | 1410 | { |
1395 | 1411 | "cell_type": "code", |
1396 | 1412 | "execution_count": 14, |
|
1700 | 1716 | } |
1701 | 1717 | ], |
1702 | 1718 | "source": [ |
| 1719 | + "# attach bias variable again to the decoded test set (only for categorical data)\n", |
1703 | 1720 | "decoded_X_test[bias_variable] = y_test.values\n", |
| 1721 | + "\n", |
| 1722 | + "# attach predicted cluster label to the decoded test set\n", |
1704 | 1723 | "decoded_X_test[\"cluster_label\"] = cluster_label_X_test\n", |
1705 | 1724 | "decoded_X_test.head()" |
1706 | 1725 | ] |
|
1769 | 1788 | "output_type": "stream", |
1770 | 1789 | "text": [ |
1771 | 1790 | "The label indicating the most disavanteagous bias: -1\n", |
1772 | | - "Most biased cluster: 83/1169 (0.071)\n", |
1773 | | - "Rest of dataset: 4/274 (0.015)\n", |
| 1791 | + "Most biased cluster in test set: 83/1169 (0.071)\n", |
| 1792 | + "Rest of test set: 4/274 (0.015)\n", |
1774 | 1793 | "Z-statistic: 3.5304\n", |
1775 | 1794 | "P-value: 0.0002\n", |
1776 | 1795 | "The bias variable occurs statistically significant more often than in the rest of the dataset.\n" |
|
1794 | 1813 | "z_stat, p_val = proportions_ztest(counts, nobs, alternative='larger')\n", |
1795 | 1814 | "\n", |
1796 | 1815 | "print(f\"The label indicating the most disavanteagous bias: {most_biased_cluster_label}\")\n", |
1797 | | - "print(f\"Most biased cluster: {most_biased_count}/{most_biased_total} ({most_biased_count/most_biased_total:.3f})\")\n", |
1798 | | - "print(f\"Rest of dataset: {rest_count}/{rest_total} ({rest_count/rest_total:.3f})\")\n", |
| 1816 | + "print(f\"Most biased cluster in test set: {most_biased_count}/{most_biased_total} ({most_biased_count/most_biased_total:.3f})\")\n", |
| 1817 | + "print(f\"Rest of test set: {rest_count}/{rest_total} ({rest_count/rest_total:.3f})\")\n", |
1799 | 1818 | "print(f\"Z-statistic: {z_stat:.4f}\")\n", |
1800 | 1819 | "print(f\"P-value: {p_val:.4f}\")\n", |
1801 | 1820 | "\n", |
|
1826 | 1845 | "cell_type": "markdown", |
1827 | 1846 | "metadata": {}, |
1828 | 1847 | "source": [ |
1829 | | - "**Accordion 1**\n", |
| 1848 | + "**Accordion 'Features per cluster'**\n", |
1830 | 1849 | "\n", |
1831 | 1850 | "[if p<0.05]\n", |
1832 | 1851 | "\n", |
|
1853 | 1872 | ], |
1854 | 1873 | "source": [ |
1855 | 1874 | "# Group by cluster_label and count the occurrences\n", |
1856 | | - "cluster_counts = decoded_X_test[\"cluster_label\"].value_counts()\n", |
| 1875 | + "df = decoded_X_test\n", |
| 1876 | + "cluster_counts = df[\"cluster_label\"].value_counts()\n", |
1857 | 1877 | "\n", |
1858 | 1878 | "# Create subplots for each column\n", |
1859 | | - "columns_to_analyze = decoded_X_test.columns.drop(['cluster_label', bias_variable]) # exclude cluster_label and bias variable\n", |
| 1879 | + "columns_to_analyze = df.columns.drop(['cluster_label', bias_variable]) # exclude cluster_label and bias variable\n", |
1860 | 1880 | "rows = (len(columns_to_analyze) + 2) // 3 # Calculate the number of rows needed\n", |
1861 | 1881 | "fig, axes = plt.subplots(rows, min(len(columns_to_analyze), 3), figsize=(15, 3 * rows), squeeze=False)\n", |
1862 | 1882 | "axes = axes.flatten() # Flatten the axes array for easier indexing\n", |
1863 | 1883 | "\n", |
1864 | 1884 | "for i, column in enumerate(columns_to_analyze):\n", |
1865 | 1885 | " # Group by cluster_label and the column, then calculate percentages\n", |
1866 | | - " grouped_data = decoded_X_test.groupby([\"cluster_label\", column]).size().unstack(fill_value=0)\n", |
| 1886 | + " grouped_data = df.groupby([\"cluster_label\", column]).size().unstack(fill_value=0)\n", |
1867 | 1887 | " percentages = grouped_data.div(grouped_data.sum(axis=1), axis=0) * 100\n", |
1868 | 1888 | " \n", |
1869 | 1889 | " # Plot the percentage data without legend\n", |
|
1873 | 1893 | " axes[i].set_xticklabels(percentages.T.index, rotation=0)\n", |
1874 | 1894 | " \n", |
1875 | 1895 | " # Calculate and plot the average percentage in the entire dataset for each category value\n", |
1876 | | - " overall_counts = decoded_X_test[column].value_counts(normalize=True) * 100\n", |
| 1896 | + " overall_counts = df[column].value_counts(normalize=True) * 100\n", |
1877 | 1897 | " for cat_value, avg_pct in overall_counts.items():\n", |
1878 | 1898 | " # Find the x position for this category value\n", |
1879 | 1899 | " try:\n", |
|
1908 | 1928 | }, |
1909 | 1929 | { |
1910 | 1930 | "cell_type": "code", |
1911 | | - "execution_count": null, |
| 1931 | + "execution_count": 20, |
1912 | 1932 | "metadata": {}, |
1913 | 1933 | "outputs": [], |
1914 | 1934 | "source": [ |
|
1979 | 1999 | " print(f\"{var[0]}: '{var[1]}' doesn't occur statistically significant more or less often than in the rest of the dataset.\\033[0m\")" |
1980 | 2000 | ] |
1981 | 2001 | }, |
| 2002 | + { |
| 2003 | + "cell_type": "markdown", |
| 2004 | + "metadata": {}, |
| 2005 | + "source": [ |
| 2006 | + "**Accordion 'Statistical significant difference wrt. cluster features'**\n", |
| 2007 | + "\n", |
| 2008 | + "[if p<0.05]\n", |
| 2009 | + "\n", |
| 2010 | + "**UI text #8**\n", |
| 2011 | + "\n", |
| 2012 | + "The following statistical test is conducted for each feature:\n", |
| 2013 | + "\n", |
| 2014 | + "$H_0$: feature doesn't occur more often in most deviating cluster compared to the rest of the dataset\n", |
| 2015 | + "\n", |
| 2016 | + "$H_1$: feature does occur more often in most deviating cluster compared to the rest of the dataset\n", |
| 2017 | + "\n", |
| 2018 | + "For categorical data a two-sided chi-squared-test, while for numerical data a two-sided t-test is used. To account for multiple hypothesis testing Bonferroni correction is applied." |
| 2019 | + ] |
| 2020 | + }, |
1982 | 2021 | { |
1983 | 2022 | "cell_type": "code", |
1984 | | - "execution_count": 22, |
| 2023 | + "execution_count": 21, |
1985 | 2024 | "metadata": {}, |
1986 | 2025 | "outputs": [ |
1987 | 2026 | { |
|
2023 | 2062 | "\n", |
2024 | 2063 | "**UI text #9**\n", |
2025 | 2064 | "\n", |
2026 | | - "**7. Bias report**\n", |
| 2065 | + "**7. Conclusion and bias report**\n", |
2027 | 2066 | "\n", |
2028 | 2067 | "[Download]" |
2029 | 2068 | ] |
|
0 commit comments