|
| 1 | +## Restaurant Demo |
| 2 | + |
| 3 | +PyDP library version for [Google's Java Differential Privacy Library - Restaurant Example](https://github.com/google/differential-privacy/tree/master/examples/java). |
| 4 | + |
| 5 | +Imagine a fictional restaurant owner named Alice who would like to share |
| 6 | +business statistics with her visitors. Alice knows when visitors enter the |
| 7 | +restaurant and how much time and money they spend there. To ensure that |
| 8 | +visitors' privacy is preserved, Alice decides to use the Differential Privacy |
| 9 | +library in this case PyDP library. |
| 10 | + |
| 11 | +Alice wants to share the information with potential clients which include 4 main scenarios in total. |
| 12 | + |
| 13 | + |
| 14 | +* <b>Count visits by an hour of the day:</b> Count how many visitors enter the restaurant at every hour of a particular day. |
| 15 | +* <b>Count visits by day of the week:</b> Count how many visitors enter the restaurant each day in a week. |
| 16 | +* <b>Sum-up revenue per day of the week:</b> Calculate the sum of the restaurant revenue per weekday. |
| 17 | +* <b>Sum-up revenue per day of the week with preaggregation.</b> |
| 18 | + |
| 19 | + |
| 20 | +Notebook Implementation for the same can be found [here](https://github.com/OpenMined/PyDP/blob/dev/examples/Tutorial_2-restaurant_demo/restaurant_demo.ipynb) |
| 21 | + |
| 22 | +## To Run the Demo |
| 23 | +Install PyDP, using the PyPI package manager: |
| 24 | + |
| 25 | +`pip install python-dp` |
| 26 | +or if you have pip3 separately for Python 3.x, use `pip3 install python-dp` |
| 27 | + |
| 28 | +Navigate to `PyDP/examples/restaurant_demo` folder and execute `Python restaurant.py` |
| 29 | + |
| 30 | +The output will display Private and Non-Private counts for: |
| 31 | +* Count visits by hour of day |
| 32 | +* Count visits by day of week |
| 33 | +* Sum-up revenue per day of the week |
| 34 | +* Sum-up revenue per day of the week with preaggregation |
| 35 | + |
| 36 | +Non-Private Count is the raw count whereas the Private Count is anonymized Count generated using PyDP library. |
| 37 | + |
| 38 | +## Count visits by hour of day |
| 39 | + |
| 40 | +In this example Alice wants to share this information with potential clients in |
| 41 | +order to let them know the busiest times in the restaurant. For this, you will |
| 42 | +count how many visitors enter the restaurant at every hour of a particular day. |
| 43 | +For simplicity, assume that a visitor comes to the restaurant at most once a |
| 44 | +day. In other words, a visitor is present at most once in the whole dataset. |
| 45 | + |
| 46 | +Visit data for a single day is stored in the `day_data.csv` file. It includes |
| 47 | +the visitor’s ID, the visit duration (in minutes), and the money spent at the |
| 48 | +restaurant. |
| 49 | + |
| 50 | +The image below illustrates the results. The Orange (right) bars represent the |
| 51 | +counts without anonymization while blue (left) bars correspond to the private |
| 52 | +(or *anonymized*) counts. You can see that the private values slightly differ |
| 53 | +from the actual ones but the overall trend is preserved. For example, you can |
| 54 | +clearly see that the restaurant is more busy during lunch and dinner time. |
| 55 | + |
| 56 | + |
| 57 | + |
| 58 | +Note that Differential Privacy involves adding *random noise* to the actual |
| 59 | +data, so your results will most likely be slightly different. |
| 60 | + |
| 61 | +## Partitions and contributions |
| 62 | + |
| 63 | +Let's say that the resulting aggregated data is split into *partitions*. The bar |
| 64 | +chart for the private and non-private counts each have 11 partitions, one for |
| 65 | +each entry hour. |
| 66 | + |
| 67 | +More generally, a single partition represents a subset of aggregated data |
| 68 | +corresponding to a given value of the aggregation criterion. Graphically, a |
| 69 | +single partition is represented as a bar on the aggregated bar chart. |
| 70 | + |
| 71 | +Now a visitor *contributes* to a given partition if their data matches |
| 72 | +the partition criterion. For example, if a visitor enters between 8 AM and 9 AM, |
| 73 | +they *contribute* to the *8 AM partition*. |
| 74 | + |
| 75 | +Recall that in the the example above, a visitor can enter the restaurant only |
| 76 | +once per day. This implies three *contribution bounds*: |
| 77 | + |
| 78 | +* *Maximum partitions contributed*: to how many partitions can a visitor |
| 79 | + contribute? In our example, a visitor can contribute up to one partition. In |
| 80 | + other words, there is at most one time-slot when a visitor with a given id |
| 81 | + can enter the restaurant. |
| 82 | +* *Maximum contributed value*: what is the maximum value that can be |
| 83 | + contributed by a visitor to a partition? In our example, you have to count the number |
| 84 | + of visits, so the maximum contributed value is simply *1*. |
| 85 | +* *Maximum contributions per partition*: how many times can a visitor |
| 86 | + contribute to a partition? In our example, a visitor can contribute to a |
| 87 | + partition at most once. In other words, a visitor can enter the restaurant |
| 88 | + only once at a given hour. |
| 89 | + |
| 90 | +Why is this important? Differential Privacy adjusts the amount of noise to mask |
| 91 | +contributions of each visitor. More contributions require more noise. |
| 92 | + |
| 93 | +## Count visits by day of week |
| 94 | + |
| 95 | +The previous example made some over-simplifying assumptions. Now, let’s have a |
| 96 | +look at the use-case where visitors can contribute to multiple partitions. |
| 97 | + |
| 98 | +Imagine Alice decides to let visitors know which days are the busiest at her |
| 99 | +restaurant. For this, she calculates how many people visit the restaurant every |
| 100 | +day of the week. For simplicity, let’s assume a visitor enters the restaurant at |
| 101 | +most once a day but multiple times a week. |
| 102 | + |
| 103 | +Visit data for a week is stored in the `week_data.csv` file. |
| 104 | + |
| 105 | +The results are illustrated in the image below. |
| 106 | + |
| 107 | + |
| 108 | + |
| 109 | +As you can see, the private values slightly differ from the actual ones but the |
| 110 | +overall trend is preserved. |
| 111 | + |
| 112 | +Now, let’s take a closer look at the technical details. Speaking in terms of |
| 113 | +*partitions* and *contributions*, the resulting bar chart has 7 partitions: one |
| 114 | +for each day of the week. A visitor may enter the restaurant once a day and |
| 115 | +hence contribute to a partition at most once. A visitor may enter the restaurant |
| 116 | +several times a week and hence contribute to up to 7 partitions. The code below |
| 117 | +uses `Count` to calculate the differentially private count of visits for a |
| 118 | +single day. |
| 119 | + |
| 120 | +``` |
| 121 | +// Number of days a visitor may contribute to is limited to 3. All exceeding |
| 122 | +// visits will be discarded. |
| 123 | +private static final int COUNT_MAX_CONTRIBUTED_DAYS = 3; |
| 124 | +LN_3 = math.log(3) |
| 125 | +day_visits = bound_visits_per_week(self._day_visits, COUNT_MAX_CONTRIBUTED_DAYS) |
| 126 | +// Construct DP Count. |
| 127 | +if not epsilon: |
| 128 | + x = Count( |
| 129 | + epsilon=self._epsilon, |
| 130 | + l0_sensitivity=COUNT_MAX_CONTRIBUTED_DAYS, |
| 131 | + dtype="int", |
| 132 | + ) |
| 133 | + else: |
| 134 | + x = Count( |
| 135 | + epsilon=epsilon, l0_sensitivity=COUNT_MAX_CONTRIBUTED_DAYS, dtype="int" |
| 136 | + ) |
| 137 | +``` |
| 138 | + |
| 139 | +### Bounding the number of contributed partitions |
| 140 | + |
| 141 | +The parameter `COUNT_MAX_CONTRIBUTED_DAYS` defines the maximum number of |
| 142 | +partitions a visitor may contribute to. You might notice that the value of |
| 143 | +`COUNT_MAX_CONTRIBUTED_DAYS` in our example is 3 instead of 7. Why is that? |
| 144 | +Differential Privacy adds some amount of random noise to hide contributions of |
| 145 | +an individual. The more contributions an individual has, the larger the noise |
| 146 | +is. This affects the utility of the data. In order to preserve the data utility, |
| 147 | +you have to make an approximate estimate of how many times a week a person may visit a |
| 148 | +restaurant on average, and assumed that the value is around 3 instead of scaling |
| 149 | +the noise by the factor of 7. |
| 150 | + |
| 151 | +The input data can be pre-processed discarding all the exceeding visits using `bound_visits_per_week(self._day_visits, COUNT_MAX_CONTRIBUTED_DAYS)`. It is important to keep in mind that the library allows you to specify maximum amount of contributions, but doesn't validate that it is respected. |
| 152 | + |
| 153 | +## Sum-up revenue per day of the week |
| 154 | + |
| 155 | +The previous example demonstrates how the contributed partitions are bounded. |
| 156 | +Now, you will demonstrate how individual contributions are clamped. Imagine Alice |
| 157 | +decides to calculate the sum of the restaurant revenue per week day in a |
| 158 | +differentially private way. For this, she needs to sum up the visitor's daily |
| 159 | +spending at the restaurant. For simplicity, let’s assume a visitor enters the |
| 160 | +restaurant at most once a day but multiple times a week. |
| 161 | + |
| 162 | +Visit data for a week is stored in the `week_data.csv` file. |
| 163 | + |
| 164 | +The results are illustrated in the image below. |
| 165 | + |
| 166 | + |
| 167 | + |
| 168 | +`BoundedSum` is used to calculate the differentially private sums of the visitor's spendings for a single day. |
| 169 | + |
| 170 | +``` |
| 171 | +
|
| 172 | +# Cap the maximum number of visiting days at 4 per each visitor (any number above will not be taken into account) |
| 173 | +SUM_MAX_CONTRIBUTED_DAYS = 4 |
| 174 | +
|
| 175 | +# Expected minimum amount of money (in Euros) to be spent by a visitor per a single visit |
| 176 | +MIN_EUROS_SPENT = 0 |
| 177 | +
|
| 178 | +# Expected maximum amount of money (in Euros) to be spent by a visitor per a single visit |
| 179 | +MAX_EUROS_SPENT_1 = 50 |
| 180 | +MAX_EUROS_SPENT_2 = 65 |
| 181 | +
|
| 182 | +LN_3 = math.log(3) |
| 183 | +
|
| 184 | + # Use the default epsilon value if it is not given as an argument |
| 185 | + if not epsilon: |
| 186 | + x = BoundedSum( |
| 187 | + self._epsilon, |
| 188 | + MIN_EUROS_SPENT, |
| 189 | + MAX_EUROS_SPENT_1, |
| 190 | + l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS, |
| 191 | + ) |
| 192 | + else: |
| 193 | + x = BoundedSum( |
| 194 | + epsilon, |
| 195 | + MIN_EUROS_SPENT, |
| 196 | + MAX_EUROS_SPENT_1, |
| 197 | + l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS, |
| 198 | + ) |
| 199 | + // Calculate DP result. |
| 200 | + result = x.result(); |
| 201 | +``` |
| 202 | + |
| 203 | +### Sum-up revenue per day of the week with preaggregation |
| 204 | + |
| 205 | +The usage of `SUM_MAX_CONTRIBUTED_DAYS` in `BoundedSum` is similar to its usage |
| 206 | +in `Count (COUNT_MAX_CONTRIBUTED_DAYS)`, which is explained in the previous example. This section focuses on |
| 207 | +the *lower* and *upper* bounds. The parameters `MIN_EUROS_SPENT` and `MAX_EUROS_SPENT_1` of |
| 208 | +`BoundedSum` define the *contribution caps*. Every input value will be |
| 209 | +automatically clamped to the specified bounds. This is needed for calculating |
| 210 | +the sensitivity of the aggregation, and to scale the noise that will be added to |
| 211 | +the sum accordingly. |
| 212 | + |
| 213 | +**Choosing bounds**. |
| 214 | + |
| 215 | +The min and max bounds affect the utility of the sum in two potentially |
| 216 | +opposing ways: reducing the added noise, and preserving the utility. On the one |
| 217 | +hand, the added noise is proportional to the maximum of the absolute values of |
| 218 | +the bounds. Thus, the closer the bounds are to zero, the less noise is added. On |
| 219 | +the other hand, setting the min and max bound close to zero may mean that |
| 220 | +the input values are clamped more aggressively, which can decrease utility as |
| 221 | +well. |
| 222 | + |
| 223 | +The `upper` bound of `BoundedSum` is set to 65 to reflect the approximate |
| 224 | +maximum cumulative amount a visitor may spend on breakfast, lunch, and dinner. |
| 225 | + |
| 226 | +Visit data for a week is stored in the `week_data.csv` file. |
| 227 | + |
| 228 | +The results are illustrated in the image below. |
| 229 | + |
| 230 | + |
| 231 | + |
| 232 | +``` |
| 233 | +MAX_EUROS_SPENT_2 = 65 |
| 234 | +... |
| 235 | +if not epsilon: |
| 236 | + x = BoundedSum( |
| 237 | + self._epsilon, |
| 238 | + MIN_EUROS_SPENT, |
| 239 | + MAX_EUROS_SPENT_2, |
| 240 | + l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS, |
| 241 | + ) |
| 242 | + else: |
| 243 | + x = BoundedSum( |
| 244 | + epsilon, |
| 245 | + MIN_EUROS_SPENT, |
| 246 | + MAX_EUROS_SPENT_2, |
| 247 | + l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS, |
| 248 | + ) |
| 249 | +``` |
0 commit comments