Skip to content

Commit e00e898

Browse files
committed
Merge branch 'main' of https://github.com/NGO-Algorithm-Audit/py-synthpop into rl/add-types-inference
2 parents 5385269 + 570e79c commit e00e898

File tree

4 files changed

+19
-22
lines changed

4 files changed

+19
-22
lines changed

README.md

Lines changed: 17 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,35 @@
1-
# Synthpop
1+
![image](https://github.com/NGO-Algorithm-Audit/python-synthpop/blob/main/images/Header.png)
22

3-
Python implementation of the R package synthpop.
3+
# python-synthpop
44

5-
The R implementation of synthpop is a tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the dataset. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models.
5+
Python implementation of the R package [synthpop](https://cran.r-project.org/web/packages/synthpop/index.html).
66

7-
This is a reimplementation in Python which allows synthetic data to be generated via the method .generate() after the algorithm had been fit to the original data via the method .fit(). The process can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data.
7+
With this library synthetic tabular data can be produced. Synthetic data refers to artificially generated data that mimics real-world data in structure and statistical properties but does not directly originate from actual events or individuals. It supports processing numerical and categorical data using sequential modelling techniques. Artificial data are generated by drawing from conditional distributions fitted to the original data using parametric (e.g., Gaussian copula) or classification and regression trees (CART) models.
88

9-
Development status and roadmap
10-
This project is in Alpha status and the roadmap can be found here.
9+
This Python library is a reimplementation of the R package `synthpop`. Synthetic data can be generated using the `.generate()` method after fitting the a synntesizer to the original data with the `.fit()` method. The process can be largely automated using default settings or customized through user-defined settings. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data.
10+
11+
☁️ [Web app](https://local-first-bias-detection.s3.eu-central-1.amazonaws.com/synthetic-data.html) – a demo of synthetic data generation using `python-synthpop` through [WebAssembly](https://github.com/NGO-Algorithm-Audit/local-first-web-tool)
1112

1213
# Installation
1314

14-
Pip
15+
#### Pip
1516

1617
```
17-
pip install py-synthpop
18+
pip install python-synthpop
1819
```
1920

20-
Source
21+
#### Source
2122

2223
```
23-
git clone <url>
24-
cd synthpop
24+
git clone https://github.com/NGO-Algorithm-Audit/python-synthpop.git
25+
cd python-synthpop
2526
pip install -r requirements.txt
2627
python setup.py install
2728
```
2829

2930
# Examples
3031

31-
Adult dataset
32+
#### Adult dataset
3233
We will use the US adult census dataset, which is a freely available open dataset extracted from the US census bureau database. The dataset is initially designed for a binary classification problem and the task is to predict whether a person earns over $50,000 a year. The dataset is a mixture of discrete and continuous features, including age, working status (workclass), education, marital status, race, sex, relationship and hours worked per week.
3334

3435
```
@@ -44,12 +45,12 @@ Out[2]:
4445
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
4546
```
4647

47-
### synthpop
48+
### python-synthpop
4849

4950
Use default parameters for the Adult dataset:
5051

5152
```
52-
In [1]: from synthpop import Synthpop
53+
In [1]: from python-synthpop import Synthpop
5354
5455
In [2]: from datasets.adult import df, dtypes
5556
@@ -159,7 +160,7 @@ income 1 1 1 1 1
159160
### Define the visit sequence for the Adult dataset:
160161

161162
```
162-
In [1]: from synthpop import Synthpop
163+
In [1]: from python-synthpop import Synthpop
163164
164165
In [2]: from datasets.adult import df, dtypes
165166
@@ -224,8 +225,4 @@ workclass 1 0 0 0 0
224225
fnlwgt 1 1 0 1 1
225226
education 1 1 0 0 1
226227
marital-status 1 1 0 0 0
227-
```
228-
229-
# License
230-
231-
This project is being developed at Hazy Limited and is released under MIT license.
228+
```

images/Header.png

98.8 KB
Loading

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
numpy>=1.20.0
22
pandas>=1.3.0
33
scikit-learn>=1.0.0
4-
pytest>=7.0.0
4+
pytest>=7.0.0

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
setup(
77
name="python-synthpop",
8-
version="0.1.0",
8+
version="0.0.1",
99
author="Algorithm Audit",
1010
description="Python implementation of the R package synthpop for generating synthetic data",
1111
long_description=long_description,

0 commit comments

Comments
 (0)