Skip to content

Commit 58e38a3

Browse files
committed
feat: add crypto_hash_agg function
1 parent 160bde0 commit 58e38a3

File tree

2 files changed

+352
-6
lines changed

2 files changed

+352
-6
lines changed

README.md

Lines changed: 111 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,35 @@ Computes a cryptographic hash of the input value using the specified algorithm.
6060

6161
**Note:** Different data types with the same value will produce different hashes (e.g., `42::INTEGER` vs `42::BIGINT` vs `'42'::VARCHAR`).
6262

63+
### crypto_hash_agg()
64+
65+
**Syntax:**
66+
```sql
67+
crypto_hash_agg(algorithm, value ORDER BY sort_expression) → BLOB
68+
```
69+
70+
An aggregate function that computes a cryptographic hash over multiple rows of data. This is useful for creating checksums of entire datasets, detecting changes in groups of records, or generating deterministic identifiers for sets of values.
71+
72+
**Parameters:**
73+
- `algorithm` (VARCHAR): The hash algorithm name (same algorithms as `crypto_hash`)
74+
- `value`: The column/expression to hash (supports same data types as `crypto_hash`)
75+
- `ORDER BY`: **Required** - ensures deterministic ordering of values before hashing
76+
77+
**Returns:** BLOB containing the raw hash bytes, or NULL for empty result sets
78+
79+
**Important Notes:**
80+
- The `ORDER BY` clause is **mandatory** because hash aggregation is order-dependent
81+
- Values are hashed sequentially in the order specified by `ORDER BY`
82+
- For `VARCHAR` and `BLOB` types, each value's length is hashed before its content (same as list hashing)
83+
- The function produces the same hash as `crypto_hash()` would produce for an equivalent list
84+
- Empty result sets return `NULL`
85+
86+
**Use Cases:**
87+
- **Dataset Checksums**: Verify data integrity across tables or partitions
88+
- **Change Detection**: Detect if any values in a group have changed
89+
- **Merkle-like Hashing**: Create hierarchical hashes of grouped data
90+
- **Deterministic IDs**: Generate stable identifiers for sets of values
91+
6392
### Supported Hash Algorithms
6493

6594
| Algorithm | Output Size | Description |
@@ -127,6 +156,25 @@ SELECT octet_length(crypto_hash('sha2-256', 'test'));
127156
-- Handle NULL values
128157
SELECT crypto_hash('sha2-256', NULL::VARCHAR) IS NULL;
129158
-- true
159+
160+
-- Aggregate hash over multiple rows (requires ORDER BY)
161+
SELECT lower(to_hex(crypto_hash_agg('sha2-256', email ORDER BY email)))
162+
FROM users;
163+
-- Produces a single hash representing all email values in order
164+
165+
-- Aggregate hash with grouping
166+
SELECT
167+
department,
168+
lower(to_hex(crypto_hash_agg('sha2-256', employee_id ORDER BY employee_id))) as dept_hash
169+
FROM employees
170+
GROUP BY department;
171+
-- Produces a hash for each department's employee IDs
172+
173+
-- Verify aggregate produces same hash as list
174+
SELECT crypto_hash_agg('sha2-256', value ORDER BY value) =
175+
crypto_hash('sha2-256', [1, 2, 3, 4, 5]::INTEGER[])
176+
FROM (VALUES (1), (2), (3), (4), (5)) t(value);
177+
-- true (aggregate hash matches list hash)
130178
```
131179

132180
## HMAC Functions
@@ -218,21 +266,57 @@ SELECT
218266
FROM users;
219267
```
220268

269+
### Dataset Integrity Verification
270+
```sql
271+
-- Create a checksum for an entire table partition
272+
SELECT
273+
partition_date,
274+
lower(to_hex(crypto_hash_agg('blake3', transaction_id ORDER BY transaction_id))) AS partition_checksum
275+
FROM transactions
276+
GROUP BY partition_date;
277+
278+
-- Detect changes in a dataset by comparing checksums
279+
WITH current_hash AS (
280+
SELECT crypto_hash_agg('sha2-256', data ORDER BY id) AS hash
281+
FROM critical_table
282+
)
283+
SELECT hash = '\x<expected_hash_value>'::BLOB AS data_unchanged
284+
FROM current_hash;
285+
```
286+
287+
### Merkle-Style Hierarchical Hashing
288+
```sql
289+
-- Create hierarchical hashes for efficient change detection
290+
-- Level 1: Hash individual user transactions
291+
WITH user_hashes AS (
292+
SELECT
293+
user_id,
294+
crypto_hash_agg('sha2-256', transaction_id ORDER BY timestamp) AS user_hash
295+
FROM transactions
296+
GROUP BY user_id
297+
)
298+
-- Level 2: Hash all user hashes to get global hash
299+
SELECT
300+
lower(to_hex(crypto_hash_agg('sha2-256', user_hash ORDER BY user_id))) AS global_hash
301+
FROM user_hashes;
302+
```
303+
221304
## Important Notes
222305

223-
1. **Output Format**: Both `crypto_hash()` and `crypto_hmac()` return raw binary data as `BLOB`. Use `to_hex()` to convert to hexadecimal strings, or `lower(to_hex(...))` for lowercase hex.
306+
1. **Output Format**: `crypto_hash()`, `crypto_hash_agg()`, and `crypto_hmac()` all return raw binary data as `BLOB`. Use `to_hex()` to convert to hexadecimal strings, or `lower(to_hex(...))` for lowercase hex.
224307

225308
2. **Type Sensitivity**: The hash is computed on the binary representation of the data type. The same numeric value with different types will produce different hashes:
226309
```sql
227310
SELECT crypto_hash('sha2-256', 42::INTEGER) != crypto_hash('sha2-256', 42::BIGINT);
228311
-- true (different hashes)
229312
```
230313

231-
3. **NULL Handling**: Both functions return `NULL` if the input value is `NULL`.
314+
3. **NULL Handling**: `crypto_hash()` and `crypto_hmac()` return `NULL` if the input value is `NULL`. `crypto_hash_agg()` returns `NULL` for empty result sets.
232315

233-
4. **List Hashing with Length Encoding**:
234-
- For fixed-length types (integers, floats, dates, etc.) in lists, only the raw binary data is hashed
235-
- For variable-length types (`VARCHAR` and `BLOB`) in lists, each element is hashed as: `[8-byte length][content]`
316+
4. **List and Aggregate Hashing with Length Encoding**:
317+
- Applies to both `crypto_hash()` when hashing lists and `crypto_hash_agg()` when aggregating values
318+
- For fixed-length types (integers, floats, dates, etc.), only the raw binary data is hashed
319+
- For variable-length types (`VARCHAR` and `BLOB`), each element is hashed as: `[8-byte length][content]`
236320
- The length is encoded as a 64-bit unsigned integer (uint64_t) in native byte order
237321
- This prevents length extension attacks where `['ab', 'c']` would otherwise hash the same as `['a', 'bc']`
238322
- Example:
@@ -245,6 +329,11 @@ FROM users;
245329
-- 9a8acca1b6c6c0befd3fbc756aed625da998c998f7252e738c4ef061906b9b21
246330

247331
-- Different hashes prove length encoding prevents collisions
332+
333+
-- Same applies to aggregate function
334+
SELECT lower(to_hex(crypto_hash_agg('sha2-256', data ORDER BY data)))
335+
FROM (VALUES ('ab'), ('c')) t(data);
336+
-- Produces different hash than ['a', 'bc']
248337
```
249338

250339
5. **Security Considerations**:
@@ -253,7 +342,23 @@ FROM users;
253342
- For HMAC operations, use a strong, randomly generated secret key
254343
- Blake3 HMAC requires exactly a 32-byte key
255344

256-
6. **Algorithm Availability**: MD4 is deprecated and may be disabled in modern OpenSSL builds.
345+
6. **Aggregate Function Requirements**:
346+
- `crypto_hash_agg()` **requires** an `ORDER BY` clause to ensure deterministic results
347+
- Without `ORDER BY`, the function will raise an error
348+
- The aggregate produces the same hash as `crypto_hash()` would for an equivalent ordered list
349+
- Example:
350+
```sql
351+
-- This works - produces same hash as list [1,2,3,4,5]
352+
SELECT crypto_hash_agg('sha2-256', value ORDER BY value)
353+
FROM (VALUES (5), (2), (1), (4), (3)) t(value);
354+
355+
-- This fails - ORDER BY is required
356+
SELECT crypto_hash_agg('sha2-256', value)
357+
FROM (VALUES (1), (2)) t(value);
358+
-- Error: Hash aggregation requires a distinct total ordering
359+
```
360+
361+
7. **Algorithm Availability**: MD4 is deprecated and may be disabled in modern OpenSSL builds.
257362

258363
## Comparison with Built-in DuckDB Functions
259364

0 commit comments

Comments
 (0)