Skip to content

Commit 1b7a67d

Browse files
authored
Merge pull request #94 from NGO-Algorithm-Audit/JFP_edits
NL translation SDG tool
2 parents 68d3cf5 + 9322ee8 commit 1b7a67d

File tree

2 files changed

+106
-54
lines changed

2 files changed

+106
-54
lines changed

src/locales/en.ts

Lines changed: 24 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -133,10 +133,7 @@ export const en = {
133133
fieldset: {
134134
sourceDataset: 'Input',
135135
dataSet: 'Dataset',
136-
dataSetTooltip: `Preprocess your data such that:
137-
- missing values are removed or replaced;
138-
- all columns (except your outcome label column) should have the same datatypes, e.g., numerical or categorical;
139-
- the outcome label column is numerical`,
136+
dataSetTooltip: `Only categorical, numerical, or time series data can be processed. Datasets may contain a maximum of 8 columns, must have a header with column names and don't require an index column`,
140137
sdgMethod: {
141138
title: 'Method',
142139
cart: 'CART',
@@ -155,7 +152,7 @@ export const en = {
155152
},
156153
actions: {
157154
tryItOut: 'Demo dataset',
158-
runGeneration: 'Run synthetic data generation',
155+
runGeneration: 'Start synthetic data generation',
159156
analyzing: 'Analyzing...',
160157
initializing: 'Initialising...',
161158
},
@@ -183,27 +180,27 @@ export const en = {
183180
cartModelTitle: '3. Method: CART model',
184181
gaussianCopulaModelTitle: '3. Method: Gaussian Copula model',
185182
cartModelDescription:
186-
'The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points.',
187-
evaluationOfGeneratedDataTitle: '4. Evaluation of generated data',
183+
'The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points.\n \n {{samples}} synthetic data points are generated.',
184+
evaluationOfGeneratedDataTitle: '4. Evaluation of generated synthetic data',
188185
distributionsTitle: '4.1 Distributions',
189-
diagnosticsReportTitle: '4.2. Diagnostic Report',
186+
diagnosticsReportTitle: '4.2. Diagnostic report',
190187
diagnosticsTitle: 'Diagnostic Results',
191188
diagnosticsReportDescription: `For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data.
192189
193-
For numerical (or datetime) columns the following metrics are computed:
190+
For numerical or datetime columns the following metrics are computed:
194191
- {tooltip:syntheticData.missingValueSimilarity}Missing value similarity{/tooltip}
195192
- {tooltip:syntheticData.rangeCoverage}Range coverage{/tooltip}
196193
- {tooltip:syntheticData.boundaryAdherenc}Boundary adherence{/tooltip}
197194
- {tooltip:syntheticData.statisticSimilarity}Statistic similarity{/tooltip}
198195
- {tooltip:syntheticData.kolmogorovSmirnovComplement}Kolmogorov–Smirnov (KS) complement{/tooltip}
199196
200-
For categorical (or boolean) columns the following metrics are computed:
197+
For categorical columns the following metrics are computed:
201198
- {tooltip:syntheticData.missingValueSimilarity}Missing value similarity{/tooltip}
202199
- {tooltip:syntheticData.categoryCoverage}Category coverage{/tooltip}
203200
- {tooltip:syntheticData.categoryAdherence}Category adherence{/tooltip}
204201
- {tooltip:syntheticData.totalVariationComplement}Total variation (TV) complement{/tooltip}
205202
206-
💯 All values need to be close to 1.0 `,
203+
💯 For high-quality synthetic data, all values should be close to 1.0, but at least higher than 0.85.`,
207204
missingValueSimilarity:
208205
'Compares whether the synthetic data has the same proportion of missing values as the real data for a given column',
209206
rangeCoverage:
@@ -221,9 +218,9 @@ For categorical (or boolean) columns the following metrics are computed:
221218
totalVariationComplement:
222219
'Computes the similarity of a real and synthetic categorical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.',
223220
correlationMatrixTitle: 'Correlation matrix',
224-
correlationMatrixDescription: `The matrices below display the pairwise correlations in the original and synthetic data. Green cells represent weak pairwise correlations, while red cells denote strong pairwise correlations. The color patterns in the two matrices should appear identical.`,
221+
correlationMatrixDescription: `The matrices below display the pairwise correlations in the original and synthetic data. Green cells represent weak pairwise correlations, while red cells denote strong pairwise correlations. The color patterns in the two matrices should appear roughly similar.`,
225222
efficacyMetricsTitle: 'Efficacy metrics',
226-
efficacyMetricsDescription: `Efficacy metrics comparing real and synthetic datasets for downstream predictive tasks. The idea is to train a predictive model on synthetic data and evaluate its performance on real data. The type of metrics computed depends on the task:
223+
efficacyMetricsDescription: `Efficacy metrics compare real and synthetic datasets for predictive tasks. The idea is to train a predictive model on synthetic data and evaluate the model's performance on real data. The type of efficacy metric depends on the task:
227224
228225
For regression (when the target is numerical):
229226
- {tooltip:syntheticData.meanSquaredError}Mean squared error (MSE){/tooltip}
@@ -234,9 +231,9 @@ For classification (when the target is categorical):
234231
- {tooltip:syntheticData.accuracyScore}Accuracy Score{/tooltip}
235232
- {tooltip:syntheticData.weightedF1Score}Weighted F1 Score{/tooltip}`,
236233
disclosureProtectionTitle: 'Privacy metrics',
237-
disclosureProtectionDescription: `The disclosure protection metric measures the proportion of synthetic data points that closely resemble real data points (within a predefined threshold), posing a risk of traceability to personal data. A low 'risk\_rate' and a high 'disclosure\_protection\_rate' indicate effective protection against the unintentional exposure of personal data.`,
234+
disclosureProtectionDescription: `The *disclosure protection metric* measures the proportion of synthetic data points that closely resemble real data points (within a predefined threshold), posing a risk of traceability to personal data. A low 'risk\_rate' and a high 'disclosure\_protection\_rate' indicate effective protection against the unintentional exposure of personal data.`,
238235
outputDataTitle: '5. Download synthetic data and evaluation report',
239-
outputDataDescription: 'Preview of generated synthetic data',
236+
outputDataDescription: 'Preview of generated synthetic data:',
240237
moreInfoTitle: '6. More information',
241238
meanSquaredError:
242239
'Average squared difference between predicted and actual values, quantifying the accuracy of a model’s predictions by penalizing larger errors more heavily',
@@ -250,28 +247,25 @@ For classification (when the target is categorical):
250247
correlationDifference:
251248
'Correlation difference: {{correlationDifference}}',
252249
univariateText:
253-
'<br>{{samples}} synthetic data points are generated using CART. The figures below display the value frequency for each variable. The synthetic data is of high quality when the frequencies are approximately the same.',
250+
'<br>The figures below display the distribution for each variable. The synthetic data are of high quality when the distributions are roughly the same.',
254251
bivariateText:
255-
'The figures below display the differences in value frequency for a combination of variables. For comparing two categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, a so called [violin plot](https://en.wikipedia.org/wiki/Violin_plot) is shown. For comparing two numercial variables, a [LOESS plot](https://en.wikipedia.org/wiki/Local_regression) is created. For all plots holds: the synthetic data is of high quality when the shape of the distributions in the synthetic data equal the distributions in the real data.',
252+
'The figures below display the differences in distributions for a combination of two variables. For comparing two categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, a so called [violin plot](https://en.wikipedia.org/wiki/Violin_plot) is shown. For comparing two numercial variables, a [LOESS plot](https://en.wikipedia.org/wiki/Local_regression) is created. For all plots holds: the synthetic data is of high quality when the shape of the distributions are roughly the same.',
256253
moreInfo:
257254
'Do you want to learn more about synthetic data?\n \n \n \n- [python-synthpop on Github](https://github.com/NGO-Algorithm-Audit/python-synthpop)\n- [local-first web app on Github](https://github.com/NGO-Algorithm-Audit/local-first-web-tool/tree/main)\n- [Synthetic Data: what, why and how?](https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf)\n- [Knowledge Network Synthetic Data](https://online.rijksinnovatiecommunity.nl/groups/399-kennisnetwerk-synthetischedata/welcome) (Dutch public organizations)\n- [Synthetic data portal of Dutch Executive Agency for Education](https://duo.nl/open_onderwijsdata/footer/synthetische-data.jsp) (DUO)\n- [CART: synthpop resources](https://synthpop.org.uk/resources.html)\n- [Gaussian Copula - Synthetic Data Vault](https://docs.sdv.dev/sdv)',
258255
missingData: `For {tooltip:syntheticData.missingDataMARTooltip}Missing At Random (MAR){/tooltip} and {tooltip:syntheticData.missingDataMNARTooltip}Missing Not At Random (MNAR){/tooltip} data,
259-
we recommend to impute the missing data. For {tooltip:syntheticData.missingDataMCARTooltip}Missing Completely At Random (MCAR){/tooltip}, we recommend to remove the missing data.`,
256+
missing data are imputed. For {tooltip:syntheticData.missingDataMCARTooltip}Missing Completely At Random (MCAR){/tooltip}, missing data is removed.\n \n More information about the concepts MCAR, MAR en MNAR can be found in the book [Flexible Imputation of Missing Data](https://stefvanbuuren.name/fimd/sec-MCAR.html) by prof. Stef van Buuren, Utrecht University.`,
260257
missingDataMARTooltip: `**MAR (Missing At Random)**:
261-
- The probability of data being missing is related to the observed data but not the missing data itself.
262-
- The missingness can be predicted by other variables in the dataset.
263-
- Example: Students' test scores are missing, but the missingness is related to their attendance records.
264-
- Recommendation: impute missing data.`,
258+
- The probability of data being missing is related to the observed data but not the missing data itself. The missingness can be predicted by other variables in the dataset;
259+
- Example: students' test scores are missing, but the missingness is related to their attendance records;
260+
- MAR data are imputed`,
265261
missingDataMNARTooltip: `**MNAR (Missing Not At Random)**:
266-
- The probability of data being missing is related to the missing data itself.
267-
- There is a systematic pattern to the missingness that is related to the unobserved data.
268-
- Example: Patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms.
269-
- Recommendation: impute missing data.`,
262+
- The probability of data being missing is related to the missing data itself. There is a systematic pattern to the missingness that is related to the unobserved data;
263+
- Example: patients with more severe symptoms are less likely to report their symptoms, leading to missing data that is related to the severity of the symptoms;
264+
- MNAR data are imputed`,
270265
missingDataMCARTooltip: `**MCAR (Missing Completely At Random)**:
271-
- The probability of data being missing is completely independent of both observed and unobserved data.
272-
- There is no systematic pattern to the missingness.
273-
- Example: A survey respondent accidentally skips a question due to a printing error.
274-
- Recommendation: remove missing data.`,
266+
- The probability of data being missing is completely independent of both observed and unobserved data. There is no systematic pattern to the missingness;
267+
- Example: a survey respondent accidentally skips a question due to a printing error;
268+
- MCAR data are removed`,
275269
},
276270

277271
biasAnalysis: {

0 commit comments

Comments
 (0)