|
333 | 333 | "\n", |
334 | 334 | "[table]\n", |
335 | 335 | "\n", |
336 | | - "The CART method is used to generate the synthetic data. CART generally produces higher quality synthetic datasets, but might not run on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.\n", |
| 336 | + "The CART method is used to generate the synthetic data. CART generally produces high quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.\n", |
337 | 337 | "\n", |
338 | 338 | "*The original paper can be found [here](https://files.eric.ed.gov/fulltext/ED469370.pdf)." |
339 | 339 | ] |
|
557 | 557 | "### 1. Data types detection" |
558 | 558 | ] |
559 | 559 | }, |
| 560 | + { |
| 561 | + "cell_type": "markdown", |
| 562 | + "metadata": {}, |
| 563 | + "source": [ |
| 564 | + "**UI text #2**\n", |
| 565 | + "\n", |
| 566 | + "The following missing data is detected:\n", |
| 567 | + "\n", |
| 568 | + "[output]\n", |
| 569 | + "\n", |
| 570 | + "If the detected data types are incorrect, please change this locally in the source dataset before attaching it to the web app." |
| 571 | + ] |
| 572 | + }, |
560 | 573 | { |
561 | 574 | "cell_type": "code", |
562 | 575 | "execution_count": 8, |
|
578 | 591 | "print(\"Column Data Types:\", column_dtypes)" |
579 | 592 | ] |
580 | 593 | }, |
581 | | - { |
582 | | - "cell_type": "markdown", |
583 | | - "metadata": {}, |
584 | | - "source": [ |
585 | | - "**UI text #2**\n", |
586 | | - "\n", |
587 | | - "If the detected data types are incorrect, please change this locally in the source dataset before attaching it to the app." |
588 | | - ] |
589 | | - }, |
590 | 594 | { |
591 | 595 | "cell_type": "markdown", |
592 | 596 | "metadata": {}, |
593 | 597 | "source": [ |
594 | 598 | "### 2. Missing data handler" |
595 | 599 | ] |
596 | 600 | }, |
597 | | - { |
598 | | - "cell_type": "code", |
599 | | - "execution_count": 9, |
600 | | - "metadata": {}, |
601 | | - "outputs": [ |
602 | | - { |
603 | | - "name": "stdout", |
604 | | - "output_type": "stream", |
605 | | - "text": [ |
606 | | - "Detected Missingness Type: {'sex': 'MAR', 'race1': 'MAR'}\n" |
607 | | - ] |
608 | | - } |
609 | | - ], |
610 | | - "source": [ |
611 | | - "# Detect missingness\n", |
612 | | - "missingness_dict = md_handler.detect_missingness(df)\n", |
613 | | - "print(\"Detected Missingness Type:\", missingness_dict)" |
614 | | - ] |
615 | | - }, |
616 | 601 | { |
617 | 602 | "cell_type": "markdown", |
618 | 603 | "metadata": {}, |
619 | 604 | "source": [ |
620 | 605 | "**UI text #3**\n", |
621 | 606 | "\n", |
| 607 | + "The following type of missing data is detected:\n", |
| 608 | + "\n", |
| 609 | + "[output]\n", |
| 610 | + "\n", |
622 | 611 | "For Missing At Random (MAR) and Missing Not At Random (MNAR) data, we recommend to impute the missing data. For Missing Completely At Random (MCAR), we recommend to remove the missing data. See the info box for more information.\n", |
623 | 612 | "\n", |
| 613 | + "In this demo, the missing data is imputed.\n", |
| 614 | + "\n", |
624 | 615 | "_info box:_\n", |
625 | 616 | "\n", |
626 | 617 | "MCAR, MAR, and MNAR are terms used to describe different mechanisms of missing data:\n", |
|
644 | 635 | "- Recommendation: impute missing data." |
645 | 636 | ] |
646 | 637 | }, |
| 638 | + { |
| 639 | + "cell_type": "code", |
| 640 | + "execution_count": 26, |
| 641 | + "metadata": {}, |
| 642 | + "outputs": [ |
| 643 | + { |
| 644 | + "name": "stdout", |
| 645 | + "output_type": "stream", |
| 646 | + "text": [ |
| 647 | + "Detected Missingness Type: {'sex': 'MAR', 'race1': 'MAR'}\n" |
| 648 | + ] |
| 649 | + } |
| 650 | + ], |
| 651 | + "source": [ |
| 652 | + "# Detect missingness\n", |
| 653 | + "missingness_dict = md_handler.detect_missingness(df)\n", |
| 654 | + "print(\"Detected Missingness Type:\", missingness_dict)" |
| 655 | + ] |
| 656 | + }, |
647 | 657 | { |
648 | 658 | "cell_type": "code", |
649 | 659 | "execution_count": 10, |
|
671 | 681 | "cell_type": "markdown", |
672 | 682 | "metadata": {}, |
673 | 683 | "source": [ |
674 | | - "### 3. Pre-processing data" |
675 | | - ] |
676 | | - }, |
677 | | - { |
678 | | - "cell_type": "markdown", |
679 | | - "metadata": {}, |
680 | | - "source": [ |
681 | | - "**UI text #4**\n", |
682 | | - "\n", |
683 | | - "In the next step the data is pre-processed. The dataframe is transformed into numerical space. The following steps are performed:\n", |
684 | | - "\n", |
685 | | - "1. Validates the input data;\n", |
686 | | - "2. Stores the original column order;\n", |
687 | | - "3. Encoding and scaling:\n", |
688 | | - "* Encodes categorical columns using LabelEncoder or OneHotEncoder;\n", |
689 | | - "* Scales numerical columns using StandardScaler;\n", |
690 | | - "* Converts boolean columns to integers." |
| 684 | + "### [no section] Pre-processing data" |
691 | 685 | ] |
692 | 686 | }, |
693 | 687 | { |
|
797 | 791 | "cell_type": "markdown", |
798 | 792 | "metadata": {}, |
799 | 793 | "source": [ |
800 | | - "### 4. Synthetic data generation: {CART/GC}" |
| 794 | + "### 3. Synthesizer: CART" |
801 | 795 | ] |
802 | 796 | }, |
803 | 797 | { |
|
914 | 908 | "cell_type": "markdown", |
915 | 909 | "metadata": {}, |
916 | 910 | "source": [ |
917 | | - "**UI text #5**\n", |
| 911 | + "**UI text #4**\n", |
918 | 912 | "\n", |
919 | 913 | "{n_synth_data} synthetic data points are generated using CART. \n", |
920 | 914 | "\n", |
921 | | - "The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points. Then, the the synthetic data back to the original format (postprocessing)." |
| 915 | + "The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points." |
922 | 916 | ] |
923 | 917 | }, |
924 | 918 | { |
925 | 919 | "cell_type": "markdown", |
926 | 920 | "metadata": {}, |
927 | 921 | "source": [ |
928 | | - "### 5. Generated synthetic data" |
| 922 | + "### [no section] Generated synthetic data" |
929 | 923 | ] |
930 | 924 | }, |
931 | 925 | { |
|
1031 | 1025 | "cell_type": "markdown", |
1032 | 1026 | "metadata": {}, |
1033 | 1027 | "source": [ |
1034 | | - "### 6. Evaluation of generated data" |
| 1028 | + "### 4. Evaluation of generated data" |
1035 | 1029 | ] |
1036 | 1030 | }, |
1037 | 1031 | { |
|
1214 | 1208 | "cell_type": "markdown", |
1215 | 1209 | "metadata": {}, |
1216 | 1210 | "source": [ |
1217 | | - "**UI text #6**\n", |
| 1211 | + "**UI text #5**\n", |
1218 | 1212 | "\n", |
1219 | | - "{n_synth_data} synthetic data points are generated using CART. The figures below display the differences in value frequency for each variable. The synthetic data is of high quality when all bars are of equal height." |
| 1213 | + "{n_synth_data} synthetic data points are generated using CART. The figures below display the value frequency for each variable. The synthetic data is of high quality when the frequencies are approximately the same." |
1220 | 1214 | ] |
1221 | 1215 | }, |
1222 | 1216 | { |
|
1294 | 1288 | "source": [ |
1295 | 1289 | "**UI text #6**\n", |
1296 | 1290 | "\n", |
1297 | | - "The report computes the following diagnostic results for each column:\n", |
1298 | | - "- For numerical (or datetime) columns:\n", |
1299 | | - " * *Missing value similarity:* Similarity in the proportion of missing values.\n", |
1300 | | - " * *Range coverage:* Proportion of the real data's range covered by the synthetic data.\n", |
1301 | | - " * *Boundary adherence:* Fraction of synthetic values within the real data's min/max.\n", |
1302 | | - " * *Kolmogorov–Smirnov (KS) complement:* Uses the two-sample Kolmogorov–Smirnov test to compare the distributions of the two continuous columns using the empirical CDF. It returns 1 minus the KS Test D statistic, which indicates the maximum distance between the expected CDF and the observed CDF values.\n", |
1303 | | - " * *Statistic similarity:* Similarity of mean, std, and median.\n", |
1304 | | - "- For categorical (or boolean) columns:\n", |
1305 | | - " * *Missing value similarity:* Similarity in the proportion of missing values.\n", |
1306 | | - " * *Total variation (TV) complement:* Compute the complement of the total variation distance of two discrete columns.\n", |
1307 | | - " * *Category coverage:* Proportion of real categories found in synthetic data.\n", |
1308 | | - " * *Category adherence:* Fraction of synthetic values that are valid real categories.\n", |
| 1291 | + "For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data. \n", |
1309 | 1292 | "\n", |
| 1293 | + "- For numerical (or datetime) columns the following metrics are computed:\n", |
| 1294 | + " * Missing value similarity *Infobox*: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;\n", |
| 1295 | + " * Range coverage *Infobox*: Measures whether a synthetic column covers the full range of values that are present in a real column;\n", |
| 1296 | + " * Boundary adherence *Infobox*: Measures whether a synthetic column respects the minimum and maximum values of the real column. It returns the percentage of synthetic rows that adhere to the real boundaries;\n", |
| 1297 | + " * Statistic similarity *Infobox*: Measures the similarity between real column and a synthetic column by comparing the mean, standard deviation and median;\n", |
| 1298 | + " * Kolmogorov–Smirnov (KS) complement *Infobox*: Computes the similarity of a real and synthetic numerical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.\n", |
| 1299 | + "- For categorical (or boolean) columns the following metrics are computed:\n", |
| 1300 | + " * Missing value similarity *Infobox*: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;\n", |
| 1301 | + " * Category coverage *Infobox*: Measures whether a synthetic column covers all the possible categories that are present in a real column;\n", |
| 1302 | + " * Category adherence *Infobox*: Measures whether a synthetic column adheres to the same category values as the real data;\n", |
| 1303 | + " * Total variation (TV) complement *Infobox*: Computes the similarity of a real and synthetic categorical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.\n", |
1310 | 1304 | "\n", |
1311 | 1305 | "💯 All values need to be close to 1.0 " |
1312 | 1306 | ] |
|
1793 | 1787 | "cell_type": "markdown", |
1794 | 1788 | "metadata": {}, |
1795 | 1789 | "source": [ |
1796 | | - "**UI text #7**\n", |
| 1790 | + "**UI text #9**\n", |
1797 | 1791 | "\n", |
1798 | 1792 | "Do you want to learn more about synthetic data?\n", |
1799 | 1793 | "- Source code of this tool:\n", |
|
1805 | 1799 | "- [CART: synthpop resources](https://synthpop.org.uk/resources.html)\n", |
1806 | 1800 | "- [Gaussian Copula - Synthetic Data Vault](https://docs.sdv.dev/sdv)\n" |
1807 | 1801 | ] |
| 1802 | + }, |
| 1803 | + { |
| 1804 | + "cell_type": "markdown", |
| 1805 | + "metadata": {}, |
| 1806 | + "source": [] |
1808 | 1807 | } |
1809 | 1808 | ], |
1810 | 1809 | "metadata": { |
|
0 commit comments