You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/locales/en.ts
+24-30Lines changed: 24 additions & 30 deletions
Original file line number
Diff line number
Diff line change
@@ -133,10 +133,7 @@ export const en = {
133
133
fieldset: {
134
134
sourceDataset: 'Input',
135
135
dataSet: 'Dataset',
136
-
dataSetTooltip: `Preprocess your data such that:
137
-
- missing values are removed or replaced;
138
-
- all columns (except your outcome label column) should have the same datatypes, e.g., numerical or categorical;
139
-
- the outcome label column is numerical`,
136
+
dataSetTooltip: `Only categorical, numerical, or time series data can be processed. Datasets may contain a maximum of 8 columns, must have a header with column names and don't require an index column`,
'The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points.',
187
-
evaluationOfGeneratedDataTitle: '4. Evaluation of generated data',
183
+
'The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points.\n \n {{samples}} synthetic data points are generated.',
184
+
evaluationOfGeneratedDataTitle: '4. Evaluation of generated synthetic data',
188
185
distributionsTitle: '4.1 Distributions',
189
-
diagnosticsReportTitle: '4.2. Diagnostic Report',
186
+
diagnosticsReportTitle: '4.2. Diagnostic report',
190
187
diagnosticsTitle: 'Diagnostic Results',
191
188
diagnosticsReportDescription: `For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data.
192
189
193
-
For numerical (or datetime) columns the following metrics are computed:
190
+
For numerical or datetime columns the following metrics are computed:
194
191
- {tooltip:syntheticData.missingValueSimilarity}Missing value similarity{/tooltip}
💯 For high-quality synthetic data, all values should be close to 1.0, but at least higher than 0.85.`,
207
204
missingValueSimilarity:
208
205
'Compares whether the synthetic data has the same proportion of missing values as the real data for a given column',
209
206
rangeCoverage:
@@ -221,9 +218,9 @@ For categorical (or boolean) columns the following metrics are computed:
221
218
totalVariationComplement:
222
219
'Computes the similarity of a real and synthetic categorical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.',
223
220
correlationMatrixTitle: 'Correlation matrix',
224
-
correlationMatrixDescription: `The matrices below display the pairwise correlations in the original and synthetic data. Green cells represent weak pairwise correlations, while red cells denote strong pairwise correlations. The color patterns in the two matrices should appear identical.`,
221
+
correlationMatrixDescription: `The matrices below display the pairwise correlations in the original and synthetic data. Green cells represent weak pairwise correlations, while red cells denote strong pairwise correlations. The color patterns in the two matrices should appear roughly similar.`,
225
222
efficacyMetricsTitle: 'Efficacy metrics',
226
-
efficacyMetricsDescription: `Efficacy metrics comparing real and synthetic datasets for downstream predictive tasks. The idea is to train a predictive model on synthetic data and evaluate its performance on real data. The type of metrics computed depends on the task:
223
+
efficacyMetricsDescription: `Efficacy metrics compare real and synthetic datasets for predictive tasks. The idea is to train a predictive model on synthetic data and evaluate the model's performance on real data. The type of efficacy metric depends on the task:
- {tooltip:syntheticData.weightedF1Score}Weighted F1 Score{/tooltip}`,
236
233
disclosureProtectionTitle: 'Privacy metrics',
237
-
disclosureProtectionDescription: `The disclosure protection metric measures the proportion of synthetic data points that closely resemble real data points (within a predefined threshold), posing a risk of traceability to personal data. A low 'risk\_rate' and a high 'disclosure\_protection\_rate' indicate effective protection against the unintentional exposure of personal data.`,
234
+
disclosureProtectionDescription: `The *disclosure protection metric* measures the proportion of synthetic data points that closely resemble real data points (within a predefined threshold), posing a risk of traceability to personal data. A low 'risk\_rate' and a high 'disclosure\_protection\_rate' indicate effective protection against the unintentional exposure of personal data.`,
238
235
outputDataTitle: '5. Download synthetic data and evaluation report',
239
-
outputDataDescription: 'Preview of generated synthetic data',
236
+
outputDataDescription: 'Preview of generated synthetic data:',
240
237
moreInfoTitle: '6. More information',
241
238
meanSquaredError:
242
239
'Average squared difference between predicted and actual values, quantifying the accuracy of a model’s predictions by penalizing larger errors more heavily',
@@ -250,28 +247,25 @@ For classification (when the target is categorical):
'<br>{{samples}} synthetic data points are generated using CART. The figures below display the value frequency for each variable. The synthetic data is of high quality when the frequencies are approximately the same.',
250
+
'<br>The figures below display the distribution for each variable. The synthetic data are of high quality when the distributions are roughly the same.',
254
251
bivariateText:
255
-
'The figures below display the differences in value frequency for a combination of variables. For comparing two categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, a so called [violin plot](https://en.wikipedia.org/wiki/Violin_plot) is shown. For comparing two numercial variables, a [LOESS plot](https://en.wikipedia.org/wiki/Local_regression) is created. For all plots holds: the synthetic data is of high quality when the shape of the distributions in the synthetic data equal the distributions in the real data.',
252
+
'The figures below display the differences in distributions for a combination of two variables. For comparing two categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, a so called [violin plot](https://en.wikipedia.org/wiki/Violin_plot) is shown. For comparing two numercial variables, a [LOESS plot](https://en.wikipedia.org/wiki/Local_regression) is created. For all plots holds: the synthetic data is of high quality when the shape of the distributions are roughly the same.',
256
253
moreInfo:
257
254
'Do you want to learn more about synthetic data?\n \n \n \n- [python-synthpop on Github](https://github.com/NGO-Algorithm-Audit/python-synthpop)\n- [local-first web app on Github](https://github.com/NGO-Algorithm-Audit/local-first-web-tool/tree/main)\n- [Synthetic Data: what, why and how?](https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf)\n- [Knowledge Network Synthetic Data](https://online.rijksinnovatiecommunity.nl/groups/399-kennisnetwerk-synthetischedata/welcome) (Dutch public organizations)\n- [Synthetic data portal of Dutch Executive Agency for Education](https://duo.nl/open_onderwijsdata/footer/synthetische-data.jsp) (DUO)\n- [CART: synthpop resources](https://synthpop.org.uk/resources.html)\n- [Gaussian Copula - Synthetic Data Vault](https://docs.sdv.dev/sdv)',
258
255
missingData: `For {tooltip:syntheticData.missingDataMARTooltip}Missing At Random (MAR){/tooltip} and {tooltip:syntheticData.missingDataMNARTooltip}Missing Not At Random (MNAR){/tooltip} data,
259
-
we recommend to impute the missing data. For {tooltip:syntheticData.missingDataMCARTooltip}Missing Completely At Random (MCAR){/tooltip}, we recommend to remove the missing data.`,
256
+
missing data are imputed. For {tooltip:syntheticData.missingDataMCARTooltip}Missing Completely At Random (MCAR){/tooltip}, missing data is removed.\n \n More information about the concepts MCAR, MAR en MNAR can be found in the book [Flexible Imputation of Missing Data](https://stefvanbuuren.name/fimd/sec-MCAR.html) by prof. Stef van Buuren, Utrecht University.`,
260
257
missingDataMARTooltip: `**MAR (Missing At Random)**:
261
-
- The probability of data being missing is related to the observed data but not the missing data itself.
262
-
- The missingness can be predicted by other variables in the dataset.
263
-
- Example: Students' test scores are missing, but the missingness is related to their attendance records.
264
-
- Recommendation: impute missing data.`,
258
+
- The probability of data being missing is related to the observed data but not the missing data itself. The missingness can be predicted by other variables in the dataset;
259
+
- Example: students' test scores are missing, but the missingness is related to their attendance records;
260
+
- MAR data are imputed`,
265
261
missingDataMNARTooltip: `**MNAR (Missing Not At Random)**:
266
-
- The probability of data being missing is related to the missing data itself.
267
-
- There is a systematic pattern to the missingness that is related to the unobserved data.
268
-
- Example: Patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms.
269
-
- Recommendation: impute missing data.`,
262
+
- The probability of data being missing is related to the missing data itself. There is a systematic pattern to the missingness that is related to the unobserved data;
263
+
- Example: patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms;
264
+
- MNAR data are imputed`,
270
265
missingDataMCARTooltip: `**MCAR (Missing Completely At Random)**:
271
-
- The probability of data being missing is completely independent of both observed and unobserved data.
272
-
- There is no systematic pattern to the missingness.
273
-
- Example: A survey respondent accidentally skips a question due to a printing error.
274
-
- Recommendation: remove missing data.`,
266
+
- The probability of data being missing is completely independent of both observed and unobserved data. There is no systematic pattern to the missingness;
267
+
- Example: a survey respondent accidentally skips a question due to a printing error;
0 commit comments