Update CART_LawSchoolAdmissionBar.ipynb

jfparie · jfparie · commit 1ef5dc29f724 · 2025-03-06T10:50:44.000+01:00
diff --git a/notebooks/CART_LawSchoolAdmissionBar.ipynb b/notebooks/CART_LawSchoolAdmissionBar.ipynb
@@ -331,15 +331,10 @@
     "\n",
     "A subset of the [Law School Admission Bar*](https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage) dataset is used as a demo. Synthetic data will be generated for the following columns: \n",
     "\n",
-    "- sex: student gender, i.e. 1 (male), 2 (female)\n",
-    "- race1: race, i.e. asian, black, hispanic, white, other\n",
-    "- ugpa: The student's undergraduate GPA, continous variable;\n",
-    "- bar: Ground truth label indicating whether or not the student passed the bar, i.e. passed 1st time, passed 2nd time, failed, non-graduated\n",
+    "[table]\n",
     "\n",
     "The CART method is used to generate the synthetic data. CART generally produces higher quality synthetic datasets, but might not run on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.\n",
     "\n",
-    "_info box:_ The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through decision trees that splits data into homogeneous groups based on feature values one variable at a time. It uses these groups to generate plausible synthetic values for that variable.\n",
-    "\n",
     "*The original paper can be found [here](https://files.eric.ed.gov/fulltext/ED469370.pdf)."
    ]
   },
@@ -443,7 +438,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -459,6 +454,7 @@
     }
    ],
    "source": [
+    "# missing data\n",
     "print(df.isnull().sum())"
    ]
   },
@@ -531,28 +527,29 @@
    "source": [
     "**UI text #3**\n",
     "\n",
-    "You specified to {remove/impute} missing data. Based on the type of missing data, we recommend to either impute of remove. See the info box for more information.\n",
+    "For Missing At Random (MAR) and Missing Not At Random (MNAR) data, we recommend to impute the missing data. For Missing Completely At Random (MCAR), we recommend to remove the missing data. See the info box for more information.\n",
     "\n",
     "_info box:_\n",
     "\n",
     "MCAR, MAR, and MNAR are terms used to describe different mechanisms of missing data:\n",
     "\n",
     "1. **MCAR (Missing Completely At Random)**:\n",
-    "- The probability of data being missing is completely independent of both observed and unobserved data.\n",
+    "- The probability of data being missing is completely independent of both observed and unobserved data. \n",
     "- There is no systematic pattern to the missingness.\n",
     "- Example: A survey respondent accidentally skips a question due to a printing error.\n",
+    "- Recommendation: remove missing data.\n",
     "\n",
     "2. **MAR (Missing At Random)**:\n",
     "- The probability of data being missing is related to the observed data but not the missing data itself.\n",
     "- The missingness can be predicted by other variables in the dataset.\n",
     "- Example: Students' test scores are missing, but the missingness is related to their attendance records.\n",
+    "- Recommendation: impute missing data.\n",
     "\n",
     "3. **MNAR (Missing Not At Random)**:\n",
-    "- The probability of data being missing is related to the missing data itself.\n",
+    "- The probability of data being missing is related to the missing data itself. \n",
     "- There is a systematic pattern to the missingness that is related to the unobserved data.\n",
     "- Example: Patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms.\n",
-    "\n",
-    "For MAR and MNAR, synthetic data generation using imputation is recommended. For MNAR, synthetic data generation with removing the missing data is recommended."
+    "- Recommendation: impute missing data."
    ]
   },
   {
@@ -595,12 +592,10 @@
     "\n",
     "1. Validates the input data;\n",
     "2. Stores the original column order;\n",
-    "3. Calls the _preprocess method to handle encoding and scaling:\n",
+    "3. Encoding and scaling:\n",
     "* Encodes categorical columns using LabelEncoder or OneHotEncoder;\n",
     "* Scales numerical columns using StandardScaler;\n",
-    "* Converts boolean columns to integers.\n",
-    "\n",
-    "The processed data is then returned."
+    "* Converts boolean columns to integers."
    ]
   },
   {