Skip to content

Commit 84c33e0

Browse files
authored
Merge pull request #426 from diffix/edon/docs
Document count_histogram
2 parents ded1940 + 4631205 commit 84c33e0

File tree

1 file changed

+60
-2
lines changed

1 file changed

+60
-2
lines changed

docs/analyst_guide.md

Lines changed: 60 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ mechanisms that Diffix uses to protect personal data.
1919
- [Suppress bin](#suppress-bin)
2020
- [Supported functions](#supported-functions)
2121
- [Aggregates](#aggregates)
22+
- [diffix.count_histogram(aid, bin_size)](#diffixcount_histogramaid-bin_size)
2223
- [Numeric generalization functions](#numeric-generalization-functions)
2324
- [diffix.floor_by(col, K)](#diffixfloor_bycol-k)
2425
- [diffix.round_by(col, K)](#diffixround_bycol-k)
@@ -29,6 +30,7 @@ mechanisms that Diffix uses to protect personal data.
2930
- [Type casts](#type-casts)
3031
- [Utility functions](#utility-functions)
3132
- [diffix.is_suppress_bin(*)](#diffixis_suppress_bin)
33+
- [diffix.unnest_histogram(histogram)](#diffixunnest_histogramhistogram)
3234

3335
# Access levels
3436

@@ -197,11 +199,48 @@ The following versions of aggregates are supported:
197199
- `count(distinct col)` - counts distinct values of the given column.
198200
- `sum(col)` - sums values in the given column.
199201
- `avg(col)` - calculates the average of the given column.
202+
- `diffix.count_histogram(aid, bin_size=1)` - computes a histogram that describes the distribution of rows among entities.
203+
See [below](#diffixcount_histogramaid-bin_size) for details.
200204

201205
Results of these aggregates are anonymized by applying noise as described in the specification.
202206

203-
Each of the `count(...)`, `sum(...)`, `avg(...)` has an accompanying aggregate, which returns the approximate magnitude of noise added during anonymization (in terms of its standard deviation).
204-
These are: `diffix.count_noise(...)`, `diffix.sum_noise(...)`, `diffix.avg_noise(...)` respectively.
207+
Each of the `count(...)`, `sum(...)`, `avg(...)` has an accompanying aggregate,
208+
which returns the approximate magnitude of noise added during anonymization (in terms of its standard deviation).
209+
These are: `diffix.count_noise(...)`, `diffix.sum_noise(...)`, `diffix.avg_noise(...)`, respectively.
210+
211+
### diffix.count_histogram(aid, bin_size)
212+
213+
Returns a 2-dimensional array of shape `bigint[][2]`, where each entry is a pair of `[row_count, num_entities]`.
214+
The `row_count` represents the number of rows contributed by `num_entities` distinct protected entities.
215+
216+
**Example:**
217+
218+
```
219+
SELECT diffix.count_histogram(account)
220+
FROM transactions;
221+
222+
count_histogram
223+
--------------------------------
224+
{{NULL,7},{1,15},{2,13},{4,6}}
225+
(1 row)
226+
```
227+
228+
The result of the above query can be interpreted as:
229+
15 accounts have made a single transaction (1 row in result bucket), 13 accounts have made 2 transactions (2 rows),
230+
6 accounts have made 4 transactions, and 7 accounts have made some other number of transactions (identified by the `NULL` count).
231+
232+
The reported `num_entities` is a noisy value, but not the `row_count` itself. Bins with insufficient `num_entities` are merged to
233+
a suppress bin of shape `{NULL, num_entities}` where `num_entities` is also noisy. The suppress bin may itself be suppressed.
234+
235+
The optional `bin_size` parameter allows generalizing the bins' `row_count` to minimize suppression.
236+
It acts identically to the `diffix.floor_by()` function.
237+
238+
The histogram array can be unwrapped to a set of pairs by using [diffix.unnest_histogram()](#diffixunnest_histogramhistogram).
239+
240+
**Restrictions:** The `aid` parameter must be a reference to a column tagged as an AID (identifier of a protected entity).
241+
242+
In untrusted mode, `bin_size` is restricted to a money style number:
243+
1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.
205244

206245
## Numeric generalization functions
207246

@@ -265,3 +304,22 @@ GROUP BY 1
265304
### diffix.is_suppress_bin(*)
266305

267306
Aggregate that returns `true` only for the suppress bin, `false` otherwise.
307+
308+
### diffix.unnest_histogram(histogram)
309+
310+
Unnests a 2-dimensional array into a result set of 1-dimensional arrays.
311+
312+
**Example:**
313+
314+
```
315+
SELECT diffix.unnest_histogram(diffix.count_histogram(account)) AS bins
316+
FROM transactions;
317+
318+
bins
319+
----------
320+
{NULL,7}
321+
{1,15}
322+
{2,13}
323+
{4,6}
324+
(4 rows)
325+
```

0 commit comments

Comments
 (0)