Skip to content

Commit 5fc9079

Browse files
authored
Add Readme for Restaurant Demo #313 Type Documentation (#324)
* Create s1.txt * Add files via upload * Create README.md * Delete s1.txt * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Fixing the broken link an introduction to PyDP and the carrots demo link were broken. It was fixed here. * fixing broken link fixing broken link for Error Handling in PyDP * update Changes raised on 01/22/2021 update Changes raised on 01/22/2021 * update Changes raised on 01/22/2021 update Changes raised on 01/22/2021 * Update README.md * Update README.md * syncing with original repo for restaurant demo * changes as per review comment on 1/25/2021 changes as commented on 1/25/2021 * changes as review comment on 1/25/2021 changes as review comment on 1/25/2021 * Update README.md
1 parent b1f3779 commit 5fc9079

File tree

6 files changed

+250
-1
lines changed

6 files changed

+250
-1
lines changed

examples/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ This is the first example using the PyDP, about animals using the library to agg
1111

1212
### Example
1313

14-
* [Error Handling in PyDP](Sample_code/error_handing.py)
14+
* [Error Handling in PyDP](Sample_code/error_handing.py)
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
## Restaurant Demo
2+
3+
PyDP library version for [Google's Java Differential Privacy Library - Restaurant Example](https://github.com/google/differential-privacy/tree/master/examples/java).
4+
5+
Imagine a fictional restaurant owner named Alice who would like to share
6+
business statistics with her visitors. Alice knows when visitors enter the
7+
restaurant and how much time and money they spend there. To ensure that
8+
visitors' privacy is preserved, Alice decides to use the Differential Privacy
9+
library in this case PyDP library.
10+
11+
Alice wants to share the information with potential clients which include 4 main scenarios in total.
12+
13+
14+
* <b>Count visits by an hour of the day:</b> Count how many visitors enter the restaurant at every hour of a particular day.
15+
* <b>Count visits by day of the week:</b> Count how many visitors enter the restaurant each day in a week.
16+
* <b>Sum-up revenue per day of the week:</b> Calculate the sum of the restaurant revenue per weekday.
17+
* <b>Sum-up revenue per day of the week with preaggregation.</b>
18+
19+
20+
Notebook Implementation for the same can be found [here](https://github.com/OpenMined/PyDP/blob/dev/examples/Tutorial_2-restaurant_demo/restaurant_demo.ipynb)
21+
22+
## To Run the Demo
23+
Install PyDP, using the PyPI package manager:
24+
25+
`pip install python-dp`
26+
or if you have pip3 separately for Python 3.x, use `pip3 install python-dp`
27+
28+
Navigate to `PyDP/examples/restaurant_demo` folder and execute `Python restaurant.py`
29+
30+
The output will display Private and Non-Private counts for:
31+
* Count visits by hour of day
32+
* Count visits by day of week
33+
* Sum-up revenue per day of the week
34+
* Sum-up revenue per day of the week with preaggregation
35+
36+
Non-Private Count is the raw count whereas the Private Count is anonymized Count generated using PyDP library.
37+
38+
## Count visits by hour of day
39+
40+
In this example Alice wants to share this information with potential clients in
41+
order to let them know the busiest times in the restaurant. For this, you will
42+
count how many visitors enter the restaurant at every hour of a particular day.
43+
For simplicity, assume that a visitor comes to the restaurant at most once a
44+
day. In other words, a visitor is present at most once in the whole dataset.
45+
46+
Visit data for a single day is stored in the `day_data.csv` file. It includes
47+
the visitor’s ID, the visit duration (in minutes), and the money spent at the
48+
restaurant.
49+
50+
The image below illustrates the results. The Orange (right) bars represent the
51+
counts without anonymization while blue (left) bars correspond to the private
52+
(or *anonymized*) counts. You can see that the private values slightly differ
53+
from the actual ones but the overall trend is preserved. For example, you can
54+
clearly see that the restaurant is more busy during lunch and dinner time.
55+
56+
![Daily counts](img/counts_per_hour.png)
57+
58+
Note that Differential Privacy involves adding *random noise* to the actual
59+
data, so your results will most likely be slightly different.
60+
61+
## Partitions and contributions
62+
63+
Let's say that the resulting aggregated data is split into *partitions*. The bar
64+
chart for the private and non-private counts each have 11 partitions, one for
65+
each entry hour.
66+
67+
More generally, a single partition represents a subset of aggregated data
68+
corresponding to a given value of the aggregation criterion. Graphically, a
69+
single partition is represented as a bar on the aggregated bar chart.
70+
71+
Now a visitor *contributes* to a given partition if their data matches
72+
the partition criterion. For example, if a visitor enters between 8 AM and 9 AM,
73+
they *contribute* to the *8 AM partition*.
74+
75+
Recall that in the the example above, a visitor can enter the restaurant only
76+
once per day. This implies three *contribution bounds*:
77+
78+
* *Maximum partitions contributed*: to how many partitions can a visitor
79+
contribute? In our example, a visitor can contribute up to one partition. In
80+
other words, there is at most one time-slot when a visitor with a given id
81+
can enter the restaurant.
82+
* *Maximum contributed value*: what is the maximum value that can be
83+
contributed by a visitor to a partition? In our example, you have to count the number
84+
of visits, so the maximum contributed value is simply *1*.
85+
* *Maximum contributions per partition*: how many times can a visitor
86+
contribute to a partition? In our example, a visitor can contribute to a
87+
partition at most once. In other words, a visitor can enter the restaurant
88+
only once at a given hour.
89+
90+
Why is this important? Differential Privacy adjusts the amount of noise to mask
91+
contributions of each visitor. More contributions require more noise.
92+
93+
## Count visits by day of week
94+
95+
The previous example made some over-simplifying assumptions. Now, let’s have a
96+
look at the use-case where visitors can contribute to multiple partitions.
97+
98+
Imagine Alice decides to let visitors know which days are the busiest at her
99+
restaurant. For this, she calculates how many people visit the restaurant every
100+
day of the week. For simplicity, let’s assume a visitor enters the restaurant at
101+
most once a day but multiple times a week.
102+
103+
Visit data for a week is stored in the `week_data.csv` file.
104+
105+
The results are illustrated in the image below.
106+
107+
![Counts per week day](img/counts_per_day.png)
108+
109+
As you can see, the private values slightly differ from the actual ones but the
110+
overall trend is preserved.
111+
112+
Now, let’s take a closer look at the technical details. Speaking in terms of
113+
*partitions* and *contributions*, the resulting bar chart has 7 partitions: one
114+
for each day of the week. A visitor may enter the restaurant once a day and
115+
hence contribute to a partition at most once. A visitor may enter the restaurant
116+
several times a week and hence contribute to up to 7 partitions. The code below
117+
uses `Count` to calculate the differentially private count of visits for a
118+
single day.
119+
120+
```
121+
// Number of days a visitor may contribute to is limited to 3. All exceeding
122+
// visits will be discarded.
123+
private static final int COUNT_MAX_CONTRIBUTED_DAYS = 3;
124+
LN_3 = math.log(3)
125+
day_visits = bound_visits_per_week(self._day_visits, COUNT_MAX_CONTRIBUTED_DAYS)
126+
// Construct DP Count.
127+
if not epsilon:
128+
x = Count(
129+
epsilon=self._epsilon,
130+
l0_sensitivity=COUNT_MAX_CONTRIBUTED_DAYS,
131+
dtype="int",
132+
)
133+
else:
134+
x = Count(
135+
epsilon=epsilon, l0_sensitivity=COUNT_MAX_CONTRIBUTED_DAYS, dtype="int"
136+
)
137+
```
138+
139+
### Bounding the number of contributed partitions
140+
141+
The parameter `COUNT_MAX_CONTRIBUTED_DAYS` defines the maximum number of
142+
partitions a visitor may contribute to. You might notice that the value of
143+
`COUNT_MAX_CONTRIBUTED_DAYS` in our example is 3 instead of 7. Why is that?
144+
Differential Privacy adds some amount of random noise to hide contributions of
145+
an individual. The more contributions an individual has, the larger the noise
146+
is. This affects the utility of the data. In order to preserve the data utility,
147+
you have to make an approximate estimate of how many times a week a person may visit a
148+
restaurant on average, and assumed that the value is around 3 instead of scaling
149+
the noise by the factor of 7.
150+
151+
The input data can be pre-processed discarding all the exceeding visits using `bound_visits_per_week(self._day_visits, COUNT_MAX_CONTRIBUTED_DAYS)`. It is important to keep in mind that the library allows you to specify maximum amount of contributions, but doesn't validate that it is respected.
152+
153+
## Sum-up revenue per day of the week
154+
155+
The previous example demonstrates how the contributed partitions are bounded.
156+
Now, you will demonstrate how individual contributions are clamped. Imagine Alice
157+
decides to calculate the sum of the restaurant revenue per week day in a
158+
differentially private way. For this, she needs to sum up the visitor's daily
159+
spending at the restaurant. For simplicity, let’s assume a visitor enters the
160+
restaurant at most once a day but multiple times a week.
161+
162+
Visit data for a week is stored in the `week_data.csv` file.
163+
164+
The results are illustrated in the image below.
165+
166+
![Daily sums](img/sums_per_day.png)
167+
168+
`BoundedSum` is used to calculate the differentially private sums of the visitor's spendings for a single day.
169+
170+
```
171+
172+
# Cap the maximum number of visiting days at 4 per each visitor (any number above will not be taken into account)
173+
SUM_MAX_CONTRIBUTED_DAYS = 4
174+
175+
# Expected minimum amount of money (in Euros) to be spent by a visitor per a single visit
176+
MIN_EUROS_SPENT = 0
177+
178+
# Expected maximum amount of money (in Euros) to be spent by a visitor per a single visit
179+
MAX_EUROS_SPENT_1 = 50
180+
MAX_EUROS_SPENT_2 = 65
181+
182+
LN_3 = math.log(3)
183+
184+
# Use the default epsilon value if it is not given as an argument
185+
if not epsilon:
186+
x = BoundedSum(
187+
self._epsilon,
188+
MIN_EUROS_SPENT,
189+
MAX_EUROS_SPENT_1,
190+
l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS,
191+
)
192+
else:
193+
x = BoundedSum(
194+
epsilon,
195+
MIN_EUROS_SPENT,
196+
MAX_EUROS_SPENT_1,
197+
l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS,
198+
)
199+
// Calculate DP result.
200+
result = x.result();
201+
```
202+
203+
### Sum-up revenue per day of the week with preaggregation
204+
205+
The usage of `SUM_MAX_CONTRIBUTED_DAYS` in `BoundedSum` is similar to its usage
206+
in `Count (COUNT_MAX_CONTRIBUTED_DAYS)`, which is explained in the previous example. This section focuses on
207+
the *lower* and *upper* bounds. The parameters `MIN_EUROS_SPENT` and `MAX_EUROS_SPENT_1` of
208+
`BoundedSum` define the *contribution caps*. Every input value will be
209+
automatically clamped to the specified bounds. This is needed for calculating
210+
the sensitivity of the aggregation, and to scale the noise that will be added to
211+
the sum accordingly.
212+
213+
**Choosing bounds**.
214+
215+
The min and max bounds affect the utility of the sum in two potentially
216+
opposing ways: reducing the added noise, and preserving the utility. On the one
217+
hand, the added noise is proportional to the maximum of the absolute values of
218+
the bounds. Thus, the closer the bounds are to zero, the less noise is added. On
219+
the other hand, setting the min and max bound close to zero may mean that
220+
the input values are clamped more aggressively, which can decrease utility as
221+
well.
222+
223+
The `upper` bound of `BoundedSum` is set to 65 to reflect the approximate
224+
maximum cumulative amount a visitor may spend on breakfast, lunch, and dinner.
225+
226+
Visit data for a week is stored in the `week_data.csv` file.
227+
228+
The results are illustrated in the image below.
229+
230+
![Daily sums with preaggregation](img/sum_per_day_w_preaggregation.png)
231+
232+
```
233+
MAX_EUROS_SPENT_2 = 65
234+
...
235+
if not epsilon:
236+
x = BoundedSum(
237+
self._epsilon,
238+
MIN_EUROS_SPENT,
239+
MAX_EUROS_SPENT_2,
240+
l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS,
241+
)
242+
else:
243+
x = BoundedSum(
244+
epsilon,
245+
MIN_EUROS_SPENT,
246+
MAX_EUROS_SPENT_2,
247+
l0_sensitivity=SUM_MAX_CONTRIBUTED_DAYS,
248+
)
249+
```
79.5 KB
Loading
65.8 KB
Loading
87.7 KB
Loading
82.5 KB
Loading

0 commit comments

Comments
 (0)