Skip to content

Commit f8d00c2

Browse files
committed
Content update NL EN SDG.md
1 parent 4532b79 commit f8d00c2

File tree

3 files changed

+144
-72
lines changed

3 files changed

+144
-72
lines changed

content/english/technical-tools/SDG.md

Lines changed: 55 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -108,52 +108,84 @@ Try the tool below ⬇️
108108
The synthetic data generation tool performs a series of steps:
109109

110110
#### Required preparations by the user:
111-
The user shoulds prepare the following aspects relating to the processed data:
112-
- <span style="color:#005AA7">Dataset:</span> Categorical, numerical or time data.
113-
- <span style="color:#005AA7">Method:</span> By default, the CART method is used to generate synthetic data. CART generally produces higher quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.',
111+
The user shoulds prepare the following aspects to synthesize data:
112+
- <span style="color:#005AA7">Dataset:</span> Should consists of categorical, numerical and/or time data.
113+
- <span style="color:#005AA7">Method:</span> By default, the CART method is used to generate synthetic data. CART generally produces higher quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.
114114
- <span style="color:#005AA7">Number of synthetic data points:</span> Number of synthetic data points to be generated by the tool. Due to computational contstraints of browser-based synthetic data generation, the maximum is set to 5.000.
115115

116116
#### Performed by the tool
117+
The following steps are performed by the tool:
117118

118-
##### Step 1. Detect data types:
119+
##### Step 1. Data types detection:
119120
The tool detects the type of data for each column of the attached dataset (numerical, categorical or datetime).
120121

121-
##### Step 2. Processing missing data:
122+
##### Step 2. Handling missing data:
122123
- <span style="color:#005AA7">Missing at Random (MAR):</span> The probability of data being missing is related to the observed data but not the missing data itself. The missingness can be predicted by other variables in the dataset. Example: Students' test scores are missing, but the missingness is related to their attendance records. For this scenrario missing data is imputed.
123124
- <span style="color:#005AA7">Missing Not at Random (MNAR):</span> The probability of data being missing is related to the missing data itself. There is a systematic pattern to the missingness that is related to the unobserved data. Example: Patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms. For this scenrario missing data is imputed.
124125
- <span style="color:#005AA7">Missing Completely at Random (MCAR):</span> The probability of data being missing is completely independent of both observed and unobserved data. There is no systematic pattern to the missingness. Example: A survey respondent accidentally skips a question due to a printing error. In this scenario missing data is removed.
125126

126-
##### Step 3. Preprocessing:
127+
More information on the concepts MCAR, MAR en MNAR can be found in the book cby prof. Stef van Buuren, Utrecht University.
128+
129+
##### Step [unnumbered] Preprocessing:
127130
Transforms all data into numerical format using `LabelEncoder` for less than 10 unique value per categorical column, `OneHotEncoder` for less than 50 unique value per categorical column and `FrequencyEncoding` for other cases. `StandardScaler` is used for numerical data.
128131

129-
##### Step 4. Synthesizing:
132+
##### Step 3. Synthesize method
130133
- <span style="color:#005AA7">CART:</span> The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points.
131134
- <span style="color:#005AA7">Gaussian Copula:</span> Gaussian Copula works in two main steps: 1. The real data is transformed into a uniform distribution. Correlations between variables are modeled using a multivariate normal distribution (the Gaussian copula); and 2. Synthetic data is created by sampling from this Gaussian copula and transforming the samples back to the original data distributions.
132135

133-
##### Step 5. Postprocessing:
136+
##### Step [unnumbered] Postprocessing:
134137
Encoded data are transformed back to it's original format.
135138

136-
##### Step 6. Evaluation:
137-
- <span style="color:#005AA7">Visualization:</span> Univariate and bivariate plots are created for the generated synthetic data. For comparing categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, violin plot are created. For comparing numercial variables, a LOESS plot is shown. For all plots holds: the synthetic data is of high quality when the shape of the distributions in the synthetic data equal the distributions in the real data.
138-
- <span style="color:#005AA7">Diagnostic report:</span> For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data.
139+
##### Step 4. Evaluation:
140+
The generated synthetic data are evaluated in various ways.
141+
142+
###### Step 4.1 Visualization:
143+
Univariate and bivariate plots are created to compare the generated synthetic data with the real data. For comparing categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, a <a href="https://en.wikipedia.org/wiki/Violin_plot" target="_blank">violin plot</a> is created. For comparing numercial variables, a <a href="https://en.wikipedia.org/wiki/Local_regression" target="_blank">LOESS plot</a> is shown. For all plots holds: the synthetic data is of high quality when the shape of the distributions in the synthetic data equal the distributions in the real data.
144+
145+
###### Step 4.2 Diagnostic report:
146+
For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data.
139147

148+
###### Diagnostic results:
140149
For numerical (or datetime) columns the following metrics are computed:
141150

142-
- Missing value similarity: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
143-
- Range coverage: Measures whether a synthetic column covers the full range of values that are present in a real column;
144-
- Boundary adherence: Measures whether a synthetic column respects the minimum and maximum values of the real column. It returns the percentage of synthetic rows that adhere to the real boundaries;
145-
- Statistic similarity: Measures the similarity between real column and a synthetic column by comparing the mean, standard deviation and median;
146-
- Kolmogorov–Smirnov (KS) complement: Computes the similarity of a real and synthetic numerical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.
151+
- <span style="color:#005AA7">Missing value similarity:</span> Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
152+
- <span style="color:#005AA7">Range coverage:</span> Measures whether a synthetic column covers the full range of values that are present in a real column;
153+
- <span style="color:#005AA7">Boundary adherence:</span> Measures whether a synthetic column respects the minimum and maximum values of the real column. It returns the percentage of synthetic rows that adhere to the real boundaries;
154+
- <span style="color:#005AA7">Statistic similarity:</span> Measures the similarity between real column and a synthetic column by comparing the mean, standard deviation and median;
155+
- <span style="color:#005AA7">Kolmogorov–Smirnov (KS) complement:</span> Measures the maximum difference between the cumulative distribution functions (CDFs) of numerical columns in the synthetic and real dataset.
156+
157+
For categorical columns the following metrics are computed:
158+
159+
- <span style="color:#005AA7">Missing value similarity:</span> Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
160+
- <span style="color:#005AA7">Category coverage:</span> Measures whether a synthetic column covers all the possible categories that are present in a real column;
161+
- <span style="color:#005AA7">Category adherence:</span> Measures whether a synthetic column adheres to the same category values as the real data;
162+
- <span style="color:#005AA7">Total variation (TV) complement:</span> Measures the maximum difference between the cumulative distribution functions (CDFs) of categorical columns in the synthetic and real dataset.
163+
164+
💯 For high-quality synthetic data, all values should be close to 1.0, but at least higher than 0.85.
165+
166+
###### Correlation matrix:
167+
The one-to-one correlations between variables in the synthetic and real data are calculated, indicating the strength and direction of their linear relationships. The correlation matrix of the generated synthetic and real data should roughly display the same patterns.
168+
169+
###### Efficacy metrics:
170+
cComparing real and synthetic datasets for downstream predictive tasks, such as regression and classification. The idea is to train a predictive model on synthetic data and evaluate its performance on real data.
171+
172+
For regression (when the target variable is numerical), the following metrics are computed:
173+
- <span style="color:#005AA7">Mean squared error (MSE):</span> average squared difference between predicted and actual values, quantifying the accuracy of a model's predictions by penalizing larger errors more heavily;
174+
- <span style="color:#005AA7">Mean absolute error (MAE):</span> average magnitude of the errors between predicted and actual values, providing a straightforward assessment of model accuracy without emphasizing large errors;
175+
- <span style="color:#005AA7">R² score:</span> quantifies how well a model's predictions match the actual data by measuring the proportion of variance in the target variable explained by the model.
176+
177+
For classification (when the target variable is categorical), the following metrics are computed:
178+
- <span style="color:#005AA7">Accuracy score:</span> measures the proportion of correctly predicted instances out of the total instances, providing an overall assessment of a model's performance in classification tasks;
179+
- <span style="color:#005AA7">Weighted F1 score:</span> harmonic mean of precision and recall, calculated for each class and weighted by the class's support (number of true instances), providing a balanced performance measure for imbalanced datasets.
147180

148-
For categorical (or boolean) columns the following metrics are computed:
181+
###### Privacy metrics:
182+
Computing the *disclosure protection metric* for synthetic data. This metric measures the proportion of synthetic records that are too similar (within a defined threshold) to real records, posing a disclosure risk.
149183

150-
- Missing value similarity: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;
151-
- Category coverage: Measures whether a synthetic column covers all the possible categories that are present in a real column;
152-
- Category adherence: Measures whether a synthetic column adheres to the same category values as the real data;
153-
- Total variation (TV) complement: Computes the similarity of a real and synthetic categorical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.
184+
##### Step 5. Download:
185+
The generated synthetic data can de downloaded as csv and as json file. Evaluation of the synthetic data according to the above metrics can be downloaded as a evaluation report in pdf.
154186

155-
💯 All values need to be close to 1.0
156-
- <span style="color:#005AA7">Download:</span> The generated synthetic data can de downloaded as csv and as json file. Evaluation of the synthetic data according to the above metrics can be downloaded as a evaluation report in pdf.
187+
#### Documentatie
188+
Meer documentatie over de tool en onderliggende SDG methoden kunnen worden gevonden op <a href="https://github.com/NGO-Algorithm-Audit/python-synhtpop" target="_blank">Github</a>.
157189

158190
{{< container_close >}}
159191

content/nederlands/technical-tools/BDT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ Het HBAC-algoritme maximaliseert het verschil in uitkomstlabels tussen clusters.
187187
- De broncode van unsupervised bias detectie door middel van het HBAC-algoritme is beschikbaar op <a href="https://github.com/NGO-Algorithm-Audit/unsupervised-bias-detection" target="_blank">Github</a> en als <a href="https://pypi.org/project/unsupervised-bias-detection/" target="_blank">pip package</a>: `pip install unsupervised-bias-detection`.
188188
[![!pypi](https://img.shields.io/pypi/v/unsupervised-bias-detection?logo=pypi\&color=blue)](https://pypi.org/project/unsupervised-bias-detection/)
189189

190-
- De achitectuur om web apps local-first te gebruiken is ook beschikbaar op <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
190+
- De achitectuur om web apps local-only te gebruiken is ook beschikbaar op <a href="https://github.com/NGO-Algorithm-Audit/local-first-web-tool" target="_blank">Github</a>.
191191

192192
{{< container_close >}}
193193

0 commit comments

Comments
 (0)