Skip to content

Commit a4b3f66

Browse files
authored
Merge pull request #1 from NGO-Algorithm-Audit/rl/get-ready-for-v1
Rl/get ready for v1
2 parents 56706a2 + 02fb065 commit a4b3f66

File tree

20 files changed

+38125
-9
lines changed

20 files changed

+38125
-9
lines changed

.github/workflows/publish.yml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
name: Publish to PyPI
2+
3+
on:
4+
release:
5+
types: [created]
6+
7+
jobs:
8+
deploy:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v3
12+
with:
13+
fetch-depth: 0 # Required for setuptools_scm to determine version
14+
15+
- name: Set up Python
16+
uses: actions/setup-python@v4
17+
with:
18+
python-version: "3.x"
19+
20+
- name: Install dependencies
21+
run: |
22+
python -m pip install --upgrade pip
23+
pip install build twine
24+
25+
- name: Build and test package
26+
run: |
27+
pip install -e .[dev]
28+
pytest
29+
python -m build
30+
31+
- name: Publish to PyPI
32+
env:
33+
TWINE_USERNAME: __token__
34+
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
35+
run: |
36+
twine check dist/*
37+
twine upload dist/*

.github/workflows/test.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: Run Tests
2+
3+
on:
4+
pull_request:
5+
branches: [main]
6+
7+
jobs:
8+
test:
9+
runs-on: ubuntu-latest
10+
11+
steps:
12+
- uses: actions/checkout@v3
13+
14+
- name: Set up Python
15+
uses: actions/setup-python@v4
16+
with:
17+
python-version: "3.10"
18+
19+
- name: Install dependencies
20+
run: |
21+
python -m pip install --upgrade pip
22+
pip install -r requirements.txt
23+
24+
- name: Run tests
25+
run: |
26+
PYTHONPATH=$PYTHONPATH:$(pwd) pytest tests/

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ venv.bak/
2020
dist/
2121
build/
2222
*.whl
23+
pyvenv.cfg
2324

2425
# PyInstaller
2526
# Usually these files are written by a python script from a template

README.md

Lines changed: 231 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,231 @@
1-
# py-synthpop
2-
Python library for R synthpop
1+
# Synthpop
2+
3+
Python implementation of the R package synthpop.
4+
5+
The R implementation of synthpop is a tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the dataset. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models.
6+
7+
This is a reimplementation in Python which allows synthetic data to be generated via the method .generate() after the algorithm had been fit to the original data via the method .fit(). The process can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data.
8+
9+
Development status and roadmap
10+
This project is in Alpha status and the roadmap can be found here.
11+
12+
# Installation
13+
14+
Pip
15+
16+
```
17+
pip install py-synthpop
18+
```
19+
20+
Source
21+
22+
```
23+
git clone <url>
24+
cd synthpop
25+
pip install -r requirements.txt
26+
python setup.py install
27+
```
28+
29+
# Examples
30+
31+
Adult dataset
32+
We will use the US adult census dataset, which is a freely available open dataset extracted from the US census bureau database. The dataset is initially designed for a binary classification problem and the task is to predict whether a person earns over $50,000 a year. The dataset is a mixture of discrete and continuous features, including age, working status (workclass), education, marital status, race, sex, relationship and hours worked per week.
33+
34+
```
35+
In [1]: from datasets.adult import df
36+
37+
In [2]: df.head()
38+
Out[2]:
39+
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
40+
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
41+
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
42+
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
43+
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
44+
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
45+
```
46+
47+
### synthpop
48+
49+
Use default parameters for the Adult dataset:
50+
51+
```
52+
In [1]: from synthpop import Synthpop
53+
54+
In [2]: from datasets.adult import df, dtypes
55+
56+
In [3]: spop = Synthpop()
57+
58+
In [4]: spop.fit(df, dtypes)
59+
train_age
60+
train_workclass
61+
train_fnlwgt
62+
train_education
63+
train_educational-num
64+
train_marital-status
65+
train_occupation
66+
train_relationship
67+
train_race
68+
train_gender
69+
train_capital-gain
70+
train_capital-loss
71+
train_hours-per-week
72+
train_native-country
73+
train_income
74+
75+
In [5]: synth_df = spop.generate(len(df))
76+
generate_age
77+
generate_workclass
78+
generate_fnlwgt
79+
generate_education
80+
generate_educational-num
81+
generate_marital-status
82+
generate_occupation
83+
generate_relationship
84+
generate_race
85+
generate_gender
86+
generate_capital-gain
87+
generate_capital-loss
88+
generate_hours-per-week
89+
generate_native-country
90+
generate_income
91+
92+
In [6]: synth_df.head()
93+
Out[6]:
94+
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
95+
0 21 ? 213055 11th 7 Never-married ? Not-in-family Other Female 0 0 30 United-States <=50K
96+
1 23 Private 150683 HS-grad 9 Never-married Adm-clerical Not-in-family White Female 0 0 40 United-States <=50K
97+
2 61 Private 191417 10th 6 Widowed Sales Not-in-family Black Female 0 0 32 United-States <=50K
98+
3 50 Private 190762 HS-grad 9 Divorced Sales Not-in-family White Male 0 0 60 United-States <=50K
99+
4 42 Local-gov 255675 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 40 United-States <=50K
100+
101+
In [7]: spop.method
102+
Out[7]:
103+
age sample
104+
workclass cart
105+
fnlwgt cart
106+
education cart
107+
educational-num cart
108+
marital-status cart
109+
occupation cart
110+
relationship cart
111+
race cart
112+
gender cart
113+
capital-gain cart
114+
capital-loss cart
115+
hours-per-week cart
116+
native-country cart
117+
income cart
118+
dtype: object
119+
120+
In [8]: spop.visit_sequence
121+
Out[8]:
122+
age 0
123+
workclass 1
124+
fnlwgt 2
125+
education 3
126+
educational-num 4
127+
marital-status 5
128+
occupation 6
129+
relationship 7
130+
race 8
131+
gender 9
132+
capital-gain 10
133+
capital-loss 11
134+
hours-per-week 12
135+
native-country 13
136+
income 14
137+
dtype: int64
138+
139+
In [9]: spop.predictor_matrix
140+
Out[9]:
141+
age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income
142+
age 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
143+
workclass 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
144+
fnlwgt 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
145+
education 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
146+
educational-num 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
147+
marital-status 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
148+
occupation 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
149+
relationship 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
150+
race 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
151+
gender 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
152+
capital-gain 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
153+
capital-loss 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
154+
hours-per-week 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
155+
native-country 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
156+
income 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
157+
```
158+
159+
### Define the visit sequence for the Adult dataset:
160+
161+
```
162+
In [1]: from synthpop import Synthpop
163+
164+
In [2]: from datasets.adult import df, dtypes
165+
166+
In [3]: spop = Synthpop(visit_sequence=[0, 1, 5, 3, 2])
167+
168+
In [4]: spop.fit(df, dtypes)
169+
train_age
170+
train_workclass
171+
train_marital-status
172+
train_education
173+
train_fnlwgt
174+
175+
In [5]: synth_df = spop.generate(len(df))
176+
generate_age
177+
generate_workclass
178+
generate_marital-status
179+
generate_education
180+
generate_fnlwgt
181+
182+
In [6]: synth_df.head()
183+
Out[6]:
184+
age workclass fnlwgt education marital-status
185+
0 57 Self-emp-not-inc 327901 Prof-school Married-civ-spouse
186+
1 24 Private 34568 Assoc-voc Never-married
187+
2 50 Private 256861 HS-grad Married-civ-spouse
188+
3 28 Private 186239 Some-college Never-married
189+
4 38 Private 216129 Bachelors Divorced
190+
191+
In [7]: spop.method
192+
Out[7]:
193+
age sample
194+
workclass cart
195+
fnlwgt cart
196+
education cart
197+
educational-num cart
198+
marital-status cart
199+
occupation cart
200+
relationship cart
201+
race cart
202+
gender cart
203+
capital-gain cart
204+
capital-loss cart
205+
hours-per-week cart
206+
native-country cart
207+
income cart
208+
dtype: object
209+
210+
In [8]: spop.visit_sequence
211+
Out[8]:
212+
age 0
213+
workclass 1
214+
fnlwgt 4
215+
education 3
216+
marital-status 2
217+
dtype: int64
218+
219+
In [9]: spop.predictor_matrix
220+
Out[9]:
221+
age workclass fnlwgt education marital-status
222+
age 0 0 0 0 0
223+
workclass 1 0 0 0 0
224+
fnlwgt 1 1 0 1 1
225+
education 1 1 0 0 1
226+
marital-status 1 1 0 0 0
227+
```
228+
229+
# License
230+
231+
This project is being developed at Hazy Limited and is released under MIT license.

datasets/__init__.py

Whitespace-only changes.

datasets/adult/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import pandas as pd
2+
from pathlib import Path
3+
import json
4+
5+
6+
data_folder = Path(__file__).resolve().parent / "data"
7+
dtypes_path = (data_folder / "dtypes.json")
8+
csv_path = str(data_folder / "adult.data")
9+
10+
11+
with dtypes_path.open('r') as f:
12+
dtypes = json.load(f)
13+
columns = list(dtypes.keys())
14+
df = pd.read_csv(csv_path, header=0).astype(dtypes)

0 commit comments

Comments
 (0)