|
1 | | -# py-synthpop |
2 | | -Python library for R synthpop |
| 1 | +# Synthpop |
| 2 | + |
| 3 | +Python implementation of the R package synthpop. |
| 4 | + |
| 5 | +The R implementation of synthpop is a tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the dataset. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models. |
| 6 | + |
| 7 | +This is a reimplementation in Python which allows synthetic data to be generated via the method .generate() after the algorithm had been fit to the original data via the method .fit(). The process can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthetic data. |
| 8 | + |
| 9 | +Development status and roadmap |
| 10 | +This project is in Alpha status and the roadmap can be found here. |
| 11 | + |
| 12 | +# Installation |
| 13 | + |
| 14 | +Pip |
| 15 | + |
| 16 | +``` |
| 17 | +pip install py-synthpop |
| 18 | +``` |
| 19 | + |
| 20 | +Source |
| 21 | + |
| 22 | +``` |
| 23 | +git clone <url> |
| 24 | +cd synthpop |
| 25 | +pip install -r requirements.txt |
| 26 | +python setup.py install |
| 27 | +``` |
| 28 | + |
| 29 | +# Examples |
| 30 | + |
| 31 | +Adult dataset |
| 32 | +We will use the US adult census dataset, which is a freely available open dataset extracted from the US census bureau database. The dataset is initially designed for a binary classification problem and the task is to predict whether a person earns over $50,000 a year. The dataset is a mixture of discrete and continuous features, including age, working status (workclass), education, marital status, race, sex, relationship and hours worked per week. |
| 33 | + |
| 34 | +``` |
| 35 | +In [1]: from datasets.adult import df |
| 36 | +
|
| 37 | +In [2]: df.head() |
| 38 | +Out[2]: |
| 39 | + age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income |
| 40 | +0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K |
| 41 | +1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K |
| 42 | +2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K |
| 43 | +3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K |
| 44 | +4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K |
| 45 | +``` |
| 46 | + |
| 47 | +### synthpop |
| 48 | + |
| 49 | +Use default parameters for the Adult dataset: |
| 50 | + |
| 51 | +``` |
| 52 | +In [1]: from synthpop import Synthpop |
| 53 | +
|
| 54 | +In [2]: from datasets.adult import df, dtypes |
| 55 | +
|
| 56 | +In [3]: spop = Synthpop() |
| 57 | +
|
| 58 | +In [4]: spop.fit(df, dtypes) |
| 59 | +train_age |
| 60 | +train_workclass |
| 61 | +train_fnlwgt |
| 62 | +train_education |
| 63 | +train_educational-num |
| 64 | +train_marital-status |
| 65 | +train_occupation |
| 66 | +train_relationship |
| 67 | +train_race |
| 68 | +train_gender |
| 69 | +train_capital-gain |
| 70 | +train_capital-loss |
| 71 | +train_hours-per-week |
| 72 | +train_native-country |
| 73 | +train_income |
| 74 | +
|
| 75 | +In [5]: synth_df = spop.generate(len(df)) |
| 76 | +generate_age |
| 77 | +generate_workclass |
| 78 | +generate_fnlwgt |
| 79 | +generate_education |
| 80 | +generate_educational-num |
| 81 | +generate_marital-status |
| 82 | +generate_occupation |
| 83 | +generate_relationship |
| 84 | +generate_race |
| 85 | +generate_gender |
| 86 | +generate_capital-gain |
| 87 | +generate_capital-loss |
| 88 | +generate_hours-per-week |
| 89 | +generate_native-country |
| 90 | +generate_income |
| 91 | +
|
| 92 | +In [6]: synth_df.head() |
| 93 | +Out[6]: |
| 94 | + age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income |
| 95 | +0 21 ? 213055 11th 7 Never-married ? Not-in-family Other Female 0 0 30 United-States <=50K |
| 96 | +1 23 Private 150683 HS-grad 9 Never-married Adm-clerical Not-in-family White Female 0 0 40 United-States <=50K |
| 97 | +2 61 Private 191417 10th 6 Widowed Sales Not-in-family Black Female 0 0 32 United-States <=50K |
| 98 | +3 50 Private 190762 HS-grad 9 Divorced Sales Not-in-family White Male 0 0 60 United-States <=50K |
| 99 | +4 42 Local-gov 255675 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 40 United-States <=50K |
| 100 | +
|
| 101 | +In [7]: spop.method |
| 102 | +Out[7]: |
| 103 | +age sample |
| 104 | +workclass cart |
| 105 | +fnlwgt cart |
| 106 | +education cart |
| 107 | +educational-num cart |
| 108 | +marital-status cart |
| 109 | +occupation cart |
| 110 | +relationship cart |
| 111 | +race cart |
| 112 | +gender cart |
| 113 | +capital-gain cart |
| 114 | +capital-loss cart |
| 115 | +hours-per-week cart |
| 116 | +native-country cart |
| 117 | +income cart |
| 118 | +dtype: object |
| 119 | +
|
| 120 | +In [8]: spop.visit_sequence |
| 121 | +Out[8]: |
| 122 | +age 0 |
| 123 | +workclass 1 |
| 124 | +fnlwgt 2 |
| 125 | +education 3 |
| 126 | +educational-num 4 |
| 127 | +marital-status 5 |
| 128 | +occupation 6 |
| 129 | +relationship 7 |
| 130 | +race 8 |
| 131 | +gender 9 |
| 132 | +capital-gain 10 |
| 133 | +capital-loss 11 |
| 134 | +hours-per-week 12 |
| 135 | +native-country 13 |
| 136 | +income 14 |
| 137 | +dtype: int64 |
| 138 | +
|
| 139 | +In [9]: spop.predictor_matrix |
| 140 | +Out[9]: |
| 141 | + age workclass fnlwgt education educational-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country income |
| 142 | +age 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
| 143 | +workclass 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
| 144 | +fnlwgt 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 |
| 145 | +education 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 |
| 146 | +educational-num 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 |
| 147 | +marital-status 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 |
| 148 | +occupation 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 |
| 149 | +relationship 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 |
| 150 | +race 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 |
| 151 | +gender 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 |
| 152 | +capital-gain 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 |
| 153 | +capital-loss 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 |
| 154 | +hours-per-week 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 |
| 155 | +native-country 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 |
| 156 | +income 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 |
| 157 | +``` |
| 158 | + |
| 159 | +### Define the visit sequence for the Adult dataset: |
| 160 | + |
| 161 | +``` |
| 162 | +In [1]: from synthpop import Synthpop |
| 163 | +
|
| 164 | +In [2]: from datasets.adult import df, dtypes |
| 165 | +
|
| 166 | +In [3]: spop = Synthpop(visit_sequence=[0, 1, 5, 3, 2]) |
| 167 | +
|
| 168 | +In [4]: spop.fit(df, dtypes) |
| 169 | +train_age |
| 170 | +train_workclass |
| 171 | +train_marital-status |
| 172 | +train_education |
| 173 | +train_fnlwgt |
| 174 | +
|
| 175 | +In [5]: synth_df = spop.generate(len(df)) |
| 176 | +generate_age |
| 177 | +generate_workclass |
| 178 | +generate_marital-status |
| 179 | +generate_education |
| 180 | +generate_fnlwgt |
| 181 | +
|
| 182 | +In [6]: synth_df.head() |
| 183 | +Out[6]: |
| 184 | + age workclass fnlwgt education marital-status |
| 185 | +0 57 Self-emp-not-inc 327901 Prof-school Married-civ-spouse |
| 186 | +1 24 Private 34568 Assoc-voc Never-married |
| 187 | +2 50 Private 256861 HS-grad Married-civ-spouse |
| 188 | +3 28 Private 186239 Some-college Never-married |
| 189 | +4 38 Private 216129 Bachelors Divorced |
| 190 | +
|
| 191 | +In [7]: spop.method |
| 192 | +Out[7]: |
| 193 | +age sample |
| 194 | +workclass cart |
| 195 | +fnlwgt cart |
| 196 | +education cart |
| 197 | +educational-num cart |
| 198 | +marital-status cart |
| 199 | +occupation cart |
| 200 | +relationship cart |
| 201 | +race cart |
| 202 | +gender cart |
| 203 | +capital-gain cart |
| 204 | +capital-loss cart |
| 205 | +hours-per-week cart |
| 206 | +native-country cart |
| 207 | +income cart |
| 208 | +dtype: object |
| 209 | +
|
| 210 | +In [8]: spop.visit_sequence |
| 211 | +Out[8]: |
| 212 | +age 0 |
| 213 | +workclass 1 |
| 214 | +fnlwgt 4 |
| 215 | +education 3 |
| 216 | +marital-status 2 |
| 217 | +dtype: int64 |
| 218 | +
|
| 219 | +In [9]: spop.predictor_matrix |
| 220 | +Out[9]: |
| 221 | + age workclass fnlwgt education marital-status |
| 222 | +age 0 0 0 0 0 |
| 223 | +workclass 1 0 0 0 0 |
| 224 | +fnlwgt 1 1 0 1 1 |
| 225 | +education 1 1 0 0 1 |
| 226 | +marital-status 1 1 0 0 0 |
| 227 | +``` |
| 228 | + |
| 229 | +# License |
| 230 | + |
| 231 | +This project is being developed at Hazy Limited and is released under MIT license. |
0 commit comments