You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/english/technical-tools/SDG.md
+55-23Lines changed: 55 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -108,52 +108,84 @@ Try the tool below ⬇️
108
108
The synthetic data generation tool performs a series of steps:
109
109
110
110
#### Required preparations by the user:
111
-
The user shoulds prepare the following aspects relating to the processed data:
112
-
- <spanstyle="color:#005AA7">Dataset:</span> Categorical, numerical or time data.
113
-
- <spanstyle="color:#005AA7">Method:</span> By default, the CART method is used to generate synthetic data. CART generally produces higher quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.',
111
+
The user shoulds prepare the following aspects to synthesize data:
112
+
- <spanstyle="color:#005AA7">Dataset:</span> Should consists of categorical, numerical and/or time data.
113
+
- <spanstyle="color:#005AA7">Method:</span> By default, the CART method is used to generate synthetic data. CART generally produces higher quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.
114
114
- <spanstyle="color:#005AA7">Number of synthetic data points:</span> Number of synthetic data points to be generated by the tool. Due to computational contstraints of browser-based synthetic data generation, the maximum is set to 5.000.
115
115
116
116
#### Performed by the tool
117
+
The following steps are performed by the tool:
117
118
118
-
##### Step 1. Detect data types:
119
+
##### Step 1. Data types detection:
119
120
The tool detects the type of data for each column of the attached dataset (numerical, categorical or datetime).
120
121
121
-
##### Step 2. Processing missing data:
122
+
##### Step 2. Handling missing data:
122
123
- <spanstyle="color:#005AA7">Missing at Random (MAR):</span> The probability of data being missing is related to the observed data but not the missing data itself. The missingness can be predicted by other variables in the dataset. Example: Students' test scores are missing, but the missingness is related to their attendance records. For this scenrario missing data is imputed.
123
124
- <spanstyle="color:#005AA7">Missing Not at Random (MNAR):</span> The probability of data being missing is related to the missing data itself. There is a systematic pattern to the missingness that is related to the unobserved data. Example: Patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms. For this scenrario missing data is imputed.
124
125
- <spanstyle="color:#005AA7">Missing Completely at Random (MCAR):</span> The probability of data being missing is completely independent of both observed and unobserved data. There is no systematic pattern to the missingness. Example: A survey respondent accidentally skips a question due to a printing error. In this scenario missing data is removed.
125
126
126
-
##### Step 3. Preprocessing:
127
+
More information on the concepts MCAR, MAR en MNAR can be found in the book cby prof. Stef van Buuren, Utrecht University.
128
+
129
+
##### Step [unnumbered] Preprocessing:
127
130
Transforms all data into numerical format using `LabelEncoder` for less than 10 unique value per categorical column, `OneHotEncoder` for less than 50 unique value per categorical column and `FrequencyEncoding` for other cases. `StandardScaler` is used for numerical data.
128
131
129
-
##### Step 4. Synthesizing:
132
+
##### Step 3. Synthesize method
130
133
- <spanstyle="color:#005AA7">CART:</span> The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points.
131
134
- <spanstyle="color:#005AA7">Gaussian Copula:</span> Gaussian Copula works in two main steps: 1. The real data is transformed into a uniform distribution. Correlations between variables are modeled using a multivariate normal distribution (the Gaussian copula); and 2. Synthetic data is created by sampling from this Gaussian copula and transforming the samples back to the original data distributions.
132
135
133
-
##### Step 5. Postprocessing:
136
+
##### Step [unnumbered] Postprocessing:
134
137
Encoded data are transformed back to it's original format.
135
138
136
-
##### Step 6. Evaluation:
137
-
- <spanstyle="color:#005AA7">Visualization:</span> Univariate and bivariate plots are created for the generated synthetic data. For comparing categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, violin plot are created. For comparing numercial variables, a LOESS plot is shown. For all plots holds: the synthetic data is of high quality when the shape of the distributions in the synthetic data equal the distributions in the real data.
138
-
- <spanstyle="color:#005AA7">Diagnostic report:</span> For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data.
139
+
##### Step 4. Evaluation:
140
+
The generated synthetic data are evaluated in various ways.
141
+
142
+
###### Step 4.1 Visualization:
143
+
Univariate and bivariate plots are created to compare the generated synthetic data with the real data. For comparing categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, a <ahref="https://en.wikipedia.org/wiki/Violin_plot"target="_blank">violin plot</a> is created. For comparing numercial variables, a <ahref="https://en.wikipedia.org/wiki/Local_regression"target="_blank">LOESS plot</a> is shown. For all plots holds: the synthetic data is of high quality when the shape of the distributions in the synthetic data equal the distributions in the real data.
144
+
145
+
###### Step 4.2 Diagnostic report:
146
+
For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data.
139
147
148
+
###### Diagnostic results:
140
149
For numerical (or datetime) columns the following metrics are computed:
141
150
142
-
- Missing value similarity: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
143
-
- Range coverage: Measures whether a synthetic column covers the full range of values that are present in a real column;
144
-
- Boundary adherence: Measures whether a synthetic column respects the minimum and maximum values of the real column. It returns the percentage of synthetic rows that adhere to the real boundaries;
145
-
- Statistic similarity: Measures the similarity between real column and a synthetic column by comparing the mean, standard deviation and median;
146
-
- Kolmogorov–Smirnov (KS) complement: Computes the similarity of a real and synthetic numerical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.
151
+
- <spanstyle="color:#005AA7">Missing value similarity:</span> Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
152
+
- <spanstyle="color:#005AA7">Range coverage:</span> Measures whether a synthetic column covers the full range of values that are present in a real column;
153
+
- <spanstyle="color:#005AA7">Boundary adherence:</span> Measures whether a synthetic column respects the minimum and maximum values of the real column. It returns the percentage of synthetic rows that adhere to the real boundaries;
154
+
- <spanstyle="color:#005AA7">Statistic similarity:</span> Measures the similarity between real column and a synthetic column by comparing the mean, standard deviation and median;
155
+
- <spanstyle="color:#005AA7">Kolmogorov–Smirnov (KS) complement:</span> Measures the maximum difference between the cumulative distribution functions (CDFs) of numerical columns in the synthetic and real dataset.
156
+
157
+
For categorical columns the following metrics are computed:
158
+
159
+
- <spanstyle="color:#005AA7">Missing value similarity:</span> Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
160
+
- <spanstyle="color:#005AA7">Category coverage:</span> Measures whether a synthetic column covers all the possible categories that are present in a real column;
161
+
- <spanstyle="color:#005AA7">Category adherence:</span> Measures whether a synthetic column adheres to the same category values as the real data;
162
+
- <spanstyle="color:#005AA7">Total variation (TV) complement:</span> Measures the maximum difference between the cumulative distribution functions (CDFs) of categorical columns in the synthetic and real dataset.
163
+
164
+
💯 For high-quality synthetic data, all values should be close to 1.0, but at least higher than 0.85.
165
+
166
+
###### Correlation matrix:
167
+
The one-to-one correlations between variables in the synthetic and real data are calculated, indicating the strength and direction of their linear relationships. The correlation matrix of the generated synthetic and real data should roughly display the same patterns.
168
+
169
+
###### Efficacy metrics:
170
+
cComparing real and synthetic datasets for downstream predictive tasks, such as regression and classification. The idea is to train a predictive model on synthetic data and evaluate its performance on real data.
171
+
172
+
For regression (when the target variable is numerical), the following metrics are computed:
173
+
- <spanstyle="color:#005AA7">Mean squared error (MSE):</span> average squared difference between predicted and actual values, quantifying the accuracy of a model's predictions by penalizing larger errors more heavily;
174
+
- <spanstyle="color:#005AA7">Mean absolute error (MAE):</span> average magnitude of the errors between predicted and actual values, providing a straightforward assessment of model accuracy without emphasizing large errors;
175
+
- <spanstyle="color:#005AA7">R² score:</span> quantifies how well a model's predictions match the actual data by measuring the proportion of variance in the target variable explained by the model.
176
+
177
+
For classification (when the target variable is categorical), the following metrics are computed:
178
+
- <spanstyle="color:#005AA7">Accuracy score:</span> measures the proportion of correctly predicted instances out of the total instances, providing an overall assessment of a model's performance in classification tasks;
179
+
- <spanstyle="color:#005AA7">Weighted F1 score:</span> harmonic mean of precision and recall, calculated for each class and weighted by the class's support (number of true instances), providing a balanced performance measure for imbalanced datasets.
147
180
148
-
For categorical (or boolean) columns the following metrics are computed:
181
+
###### Privacy metrics:
182
+
Computing the *disclosure protection metric* for synthetic data. This metric measures the proportion of synthetic records that are too similar (within a defined threshold) to real records, posing a disclosure risk.
149
183
150
-
- Missing value similarity: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
151
-
- Category coverage: Measures whether a synthetic column covers all the possible categories that are present in a real column;
152
-
- Category adherence: Measures whether a synthetic column adheres to the same category values as the real data;
153
-
- Total variation (TV) complement: Computes the similarity of a real and synthetic categorical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.
184
+
##### Step 5. Download:
185
+
The generated synthetic data can de downloaded as csv and as json file. Evaluation of the synthetic data according to the above metrics can be downloaded as a evaluation report in pdf.
154
186
155
-
💯 All values need to be close to 1.0
156
-
- <spanstyle="color:#005AA7">Download:</span> The generated synthetic data can de downloaded as csv and as json file. Evaluation of the synthetic data according to the above metrics can be downloaded as a evaluation report in pdf.
187
+
#### Documentatie
188
+
Meer documentatie over de tool en onderliggende SDG methoden kunnen worden gevonden op <ahref="https://github.com/NGO-Algorithm-Audit/python-synhtpop"target="_blank">Github</a>.
Copy file name to clipboardExpand all lines: content/nederlands/technical-tools/BDT.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -187,7 +187,7 @@ Het HBAC-algoritme maximaliseert het verschil in uitkomstlabels tussen clusters.
187
187
- De broncode van unsupervised bias detectie door middel van het HBAC-algoritme is beschikbaar op <ahref="https://github.com/NGO-Algorithm-Audit/unsupervised-bias-detection"target="_blank">Github</a> en als <ahref="https://pypi.org/project/unsupervised-bias-detection/"target="_blank">pip package</a>: `pip install unsupervised-bias-detection`.
- De achitectuur om web apps local-first te gebruiken is ook beschikbaar op <ahref="https://github.com/NGO-Algorithm-Audit/local-first-web-tool"target="_blank">Github</a>.
190
+
- De achitectuur om web apps local-only te gebruiken is ook beschikbaar op <ahref="https://github.com/NGO-Algorithm-Audit/local-first-web-tool"target="_blank">Github</a>.
0 commit comments