Skip to content

Commit d3bb22a

Browse files
committed
Update readme and added author details
1 parent b09d3fe commit d3bb22a

File tree

4 files changed

+852
-207
lines changed

4 files changed

+852
-207
lines changed

README.md

Lines changed: 86 additions & 188 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,21 @@
1-
![image](https://github.com/NGO-Algorithm-Audit/python-synthpop/blob/main/images/Header.png)
1+
![image](https://raw.githubusercontent.com/NGO-Algorithm-Audit/python-synthpop/b09d3fe93ac21406199810e39e2a844dc1faefd0/images/Header.png)
22

33
# python-synthpop
44

55
Python implementation of the R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
66

7-
With this library synthetic tabular data can be produced. Synthetic data refers to artificially generated data that mimics real-world data in structure and statistical properties but does not directly originate from actual events or individuals. It supports processing numerical and categorical data using sequential modelling techniques. Artificial data are generated by drawing from conditional distributions fitted to the original data using parametric (e.g., Gaussian copula) or classification and regression trees (CART) models.
7+
```python-synthpop``` is an open-source library for synthetic data generation (SDG). The library includes robust implementations of Classification and Regression Trees (CART) and Gaussian Copula (GC) synthesizers, equipping users with an open-source python library to generate high-quality, privacy-preserving synthetic data.
88

9-
This Python library is a reimplementation of the R package `synthpop`. Synthetic data can be generated using the `.generate()` method after fitting the a synntesizer to the original data with the `.fit()` method. The process can be largely automated using default settings or customized through user-defined settings. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data.
9+
Synthetic data is generated in six steps:
1010

11-
☁️ [Web app](https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html) – a demo of synthetic data generation using `python-synthpop` through [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool)
11+
1. **Detect data types**: detect numerical, categorial and/or datetime data;
12+
2. **Process missing data**: process missing data: remove or impute missing values;
13+
3. **Preprocessing**: transforms data into numerical format;
14+
4. **Synthesizer**: fit CART or GC;
15+
5. **Postprocessing**: map synthetic data back to its original structure and domain;
16+
6. **Evaluation metrics**: determine quality of synthetic data, e.g., similarity, utility and privacy metrics.
17+
18+
☁️ [Web app](https://algorithmaudit.eu/technical-tools/sdg/#web-app) – a demo of synthetic data generation using `python-synthpop` in the browser using [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool).
1219

1320
# Installation
1421

@@ -29,200 +36,91 @@ python setup.py install
2936

3037
# Examples
3138

32-
#### Adult dataset
33-
We will use the US adult census dataset, which is a freely available open dataset extracted from the US census bureau database. The dataset is initially designed for a binary classification problem and the task is to predict whether a person earns over $50,000 a year. The dataset is a mixture of discrete and continuous features, including age, working status (workclass), education, marital status, race, sex, relationship and hours worked per week.
39+
#### Social Diagnosis 2011 dataset
40+
We will use the Social Diagnosis 2011 dataset as an example, which is a comprehensive survey conducted in Poland. This dataset includes a wide range of variables related to the social and economic conditions of Polish households and individuals. It covers aspects such as income, employment, education, health, and overall quality of life.
3441

3542
```
36-
In [1]: from datasets.adult import df
43+
In [1]: import pandas as pd
3744
38-
In [2]: df.head()
45+
In [2]: df = pd.read_csv('../datasets/socialdiagnosis/data/SocialDiagnosis2011.csv', sep=';')
46+
df.head()
3947
Out[2]:
40-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
41-
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
42-
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
43-
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
44-
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
45-
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
48+
sex age marital income ls smoke
49+
0 FEMALE 57 MARRIED 800.0 PLEASED NO
50+
1 MALE 20 SINGLE 350.0 MOSTLY SATISFIED NO
51+
2 FEMALE 18 SINGLE NaN PLEASED NO
52+
3 FEMALE 78 WIDOWED 900.0 MIXED NO
53+
4 FEMALE 54 MARRIED 1500.0 MOSTLY SATISFIED YES
54+
4655
```
4756

4857
### python-synthpop
4958

50-
Use default parameters for the Adult dataset:
59+
Using default parameters the six steps are applied on the Social Diagnosis example tot generate synthetic data. See also [link](./example_notebooks/00_readme.ipynb).
5160

5261
```
53-
In [1]: from python-synthpop import Synthpop
54-
55-
In [2]: from datasets.adult import df, dtypes
56-
57-
In [3]: spop = Synthpop()
58-
59-
In [4]: spop.fit(df, dtypes)
60-
train_age
61-
train_workclass
62-
train_fnlwgt
63-
train_education
64-
train_educational-num
65-
train_marital-status
66-
train_occupation
67-
train_relationship
68-
train_race
69-
train_gender
70-
train_capital-gain
71-
train_capital-loss
72-
train_hours-per-week
73-
train_native-country
74-
train_income
75-
76-
In [5]: synth_df = spop.generate(len(df))
77-
generate_age
78-
generate_workclass
79-
generate_fnlwgt
80-
generate_education
81-
generate_educational-num
82-
generate_marital-status
83-
generate_occupation
84-
generate_relationship
85-
generate_race
86-
generate_gender
87-
generate_capital-gain
88-
generate_capital-loss
89-
generate_hours-per-week
90-
generate_native-country
91-
generate_income
92-
93-
In [6]: synth_df.head()
94-
Out[6]:
95-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
96-
0 21 ? 213055 11th 7 Never-married ? Not-in-family Other Female 0 0 30 United-States <=50K
97-
1 23 Private 150683 HS-grad 9 Never-married Adm-clerical Not-in-family White Female 0 0 40 United-States <=50K
98-
2 61 Private 191417 10th 6 Widowed Sales Not-in-family Black Female 0 0 32 United-States <=50K
99-
3 50 Private 190762 HS-grad 9 Divorced Sales Not-in-family White Male 0 0 60 United-States <=50K
100-
4 42 Local-gov 255675 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 40 United-States <=50K
101-
102-
In [7]: spop.method
103-
Out[7]:
104-
age sample
105-
workclass cart
106-
fnlwgt cart
107-
education cart
108-
educational-num cart
109-
marital-status cart
110-
occupation cart
111-
relationship cart
112-
race cart
113-
gender cart
114-
capital-gain cart
115-
capital-loss cart
116-
hours-per-week cart
117-
native-country cart
118-
income cart
119-
dtype: object
120-
121-
In [8]: spop.visit_sequence
122-
Out[8]:
123-
age 0
124-
workclass 1
125-
fnlwgt 2
126-
education 3
127-
educational-num 4
128-
marital-status 5
129-
occupation 6
130-
relationship 7
131-
race 8
132-
gender 9
133-
capital-gain 10
134-
capital-loss 11
135-
hours-per-week 12
136-
native-country 13
137-
income 14
138-
dtype: int64
139-
140-
In [9]: spop.predictor_matrix
141-
Out[9]:
142-
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
143-
age 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
144-
workclass 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
145-
fnlwgt 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
146-
education 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
147-
educational-num 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
148-
marital-status 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
149-
occupation 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
150-
relationship 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
151-
race 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
152-
gender 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
153-
capital-gain 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
154-
capital-loss 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
155-
hours-per-week 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
156-
native-country 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
157-
income 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
158-
```
62+
In [1]: from synthpop import MissingDataHandler, DataProcessor, CARTMethod
63+
64+
In [2]: # 1. Initiate metadata
65+
metadata = MissingDataHandler()
66+
67+
# 1.1 Detect data types
68+
column_dtypes = metadata.get_column_dtypes(df)
69+
print("Column Data Types:", column_dtypes)
70+
71+
Column Data Types: {'sex': 'categorical', 'age': 'numerical', 'marital': 'categorical', 'income': 'numerical', 'ls': 'categorical', 'smoke': 'categorical'}
72+
73+
In [3]: # 2. Missing data
74+
print(df.isnull().sum())
75+
76+
sex 0
77+
age 0
78+
marital 9
79+
income 683
80+
ls 8
81+
smoke 10
82+
dtype: int64
83+
84+
In [4]: # 2.1 Detect type of missingness
85+
missingness_dict = metadata.detect_missingness(df)
86+
print("Detected missingness yype:", missingness_dict)
87+
88+
Detected missingness type: {'marital': 'MAR', 'income': 'MAR', 'ls': 'MAR', 'smoke': 'MAR'}
89+
90+
91+
In [5]: # 2.2 Impute missing values
92+
df_imputed = metadata.apply_imputation(df, missingness_dict)
93+
94+
print(df_imputed.isnull().sum())
95+
96+
sex 0
97+
age 0
98+
marital 0
99+
income 0
100+
ls 0
101+
smoke 0
102+
dtype: int64
103+
104+
105+
In [6]: # 3. Instantiate the DataProcessor with column types
106+
processor = DataProcessor(column_dtypes)
107+
108+
# 3.1 Preprocess the data: transforms raw data into a numerical format
109+
processed_data = processor.preprocess(df)
110+
print("Processed Data:")
111+
display(processed_data.head())
112+
113+
Processed Data:
114+
sex age marital income ls smoke
115+
0 0 0.503625 3 -0.480608 4 0
116+
1 1 -1.495187 4 -0.834521 3 0
117+
2 0 -1.603231 4 NaN 4 0
118+
3 0 1.638086 5 -0.401961 1 0
119+
4 0 0.341559 3 0.069923 3 1
120+
121+
In [7]: # 4. Fit the CART method
122+
cart = CARTMethod(metadata, smoothing=True, proper=True, minibucket=5, random_state=42)
123+
cart.fit(processed_data)
159124
160-
### Define the visit sequence for the Adult dataset:
161125
162-
```
163-
In [1]: from python-synthpop import Synthpop
164-
165-
In [2]: from datasets.adult import df, dtypes
166-
167-
In [3]: spop = Synthpop(visit_sequence=[0, 1, 5, 3, 2])
168-
169-
In [4]: spop.fit(df, dtypes)
170-
train_age
171-
train_workclass
172-
train_marital-status
173-
train_education
174-
train_fnlwgt
175-
176-
In [5]: synth_df = spop.generate(len(df))
177-
generate_age
178-
generate_workclass
179-
generate_marital-status
180-
generate_education
181-
generate_fnlwgt
182-
183-
In [6]: synth_df.head()
184-
Out[6]:
185-
age workclass fnlwgt education marital-status
186-
0 57 Self-emp-not-inc 327901 Prof-school Married-civ-spouse
187-
1 24 Private 34568 Assoc-voc Never-married
188-
2 50 Private 256861 HS-grad Married-civ-spouse
189-
3 28 Private 186239 Some-college Never-married
190-
4 38 Private 216129 Bachelors Divorced
191-
192-
In [7]: spop.method
193-
Out[7]:
194-
age sample
195-
workclass cart
196-
fnlwgt cart
197-
education cart
198-
educational-num cart
199-
marital-status cart
200-
occupation cart
201-
relationship cart
202-
race cart
203-
gender cart
204-
capital-gain cart
205-
capital-loss cart
206-
hours-per-week cart
207-
native-country cart
208-
income cart
209-
dtype: object
210-
211-
In [8]: spop.visit_sequence
212-
Out[8]:
213-
age 0
214-
workclass 1
215-
fnlwgt 4
216-
education 3
217-
marital-status 2
218-
dtype: int64
219-
220-
In [9]: spop.predictor_matrix
221-
Out[9]:
222-
age workclass fnlwgt education marital-status
223-
age 0 0 0 0 0
224-
workclass 1 0 0 0 0
225-
fnlwgt 1 1 0 1 1
226-
education 1 1 0 0 1
227-
marital-status 1 1 0 0 0
228126
```

0 commit comments

Comments
 (0)