|
35 | 35 | "outputs": [], |
36 | 36 | "source": [ |
37 | 37 | "from skimpy import skim\n", |
38 | | - "from pandas_profiling import ProfileReport\n", |
39 | 38 | "import pandas as pd\n", |
40 | 39 | "from pandas.api.types import CategoricalDtype\n", |
41 | 40 | "from lets_plot import *\n", |
|
1081 | 1080 | "skim(taxis)" |
1082 | 1081 | ] |
1083 | 1082 | }, |
1084 | | - { |
1085 | | - "cell_type": "markdown", |
1086 | | - "id": "0a1fc099", |
1087 | | - "metadata": {}, |
1088 | | - "source": [ |
1089 | | - "### The **pandas-profiling** package\n", |
1090 | | - "\n", |
1091 | | - "The EDA we did using the built-in **pandas** functions was a bit limited and user-input heavy. The [**pandas-profiling**](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) library aims to automate the legwork of EDA for you. It generates 'profile' reports from a pandas DataFrame. For each column, many statistics are computed and then relayed in an interactive HTML report. To install it, run `pip install pandas-profiling` in the terminal.\n", |
1092 | | - "\n", |
1093 | | - "Let's generate a report on our dataset. If you are using a large dataset, you may wish to employ the `minimal=True` setting that cuts out a lot of computationally expensive extras:" |
1094 | | - ] |
1095 | | - }, |
1096 | | - { |
1097 | | - "cell_type": "code", |
1098 | | - "execution_count": null, |
1099 | | - "id": "39739f74", |
1100 | | - "metadata": {}, |
1101 | | - "outputs": [], |
1102 | | - "source": [ |
1103 | | - "profile = ProfileReport(taxis, minimal=True, title=\"Profiling Report: Taxis Dataset\")\n", |
1104 | | - "profile.to_notebook_iframe()" |
1105 | | - ] |
1106 | | - }, |
1107 | | - { |
1108 | | - "cell_type": "markdown", |
1109 | | - "id": "f2494069", |
1110 | | - "metadata": {}, |
1111 | | - "source": [ |
1112 | | - "This is a full on report about everything in our dataset! We can see, for instance, that we have 14 variables and what kind each of them are.\n", |
1113 | | - "\n", |
1114 | | - "The alerts page shows where **pandas-profiling** really shines. It flags *potential* issues with the data that should be taken into account in any subsequent analysis. For example, although not relevant here, the report will say if there are very unbalanced classes in a low cardinality categorical variable.\n", |
1115 | | - "\n", |
1116 | | - "Another good package for automated EDA is [dataprep](https://dataprep.ai/)." |
1117 | | - ] |
1118 | | - }, |
1119 | 1083 | { |
1120 | 1084 | "cell_type": "markdown", |
1121 | 1085 | "id": "f2810c9e", |
|
1156 | 1120 | "name": "python", |
1157 | 1121 | "nbconvert_exporter": "python", |
1158 | 1122 | "pygments_lexer": "ipython3", |
1159 | | - "version": "3.10.12" |
| 1123 | + "version": "3.10.13" |
1160 | 1124 | }, |
1161 | 1125 | "toc-showtags": true |
1162 | 1126 | }, |
|
0 commit comments