Skip to content
Closed

Ci #1

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
1a962cc
feat: added benchmark script and related fixes
carlosganzerla Jul 28, 2025
41da8f9
tests: added large set test
carlosganzerla Jul 30, 2025
470f313
feat: added GIN index support
carlosganzerla Jul 31, 2025
61f3f55
feat: added GiST index support
carlosganzerla Aug 12, 2025
dc46b8d
test: readded large set test without intarray dep
carlosganzerla Aug 29, 2025
2a0b2ec
feat: added GiST to benchmark.py
carlosganzerla Sep 1, 2025
98da30f
fix: better penalty implementation
carlosganzerla Sep 1, 2025
84b7603
feat: added determinsitic data to benchmark
carlosganzerla Sep 4, 2025
061ce9a
fix: fixed benchmark Postgres params table
carlosganzerla Sep 11, 2025
6cb6cfb
feat: added extra operators and hash index
carlosganzerla Sep 15, 2025
a2db0b8
feat: ANALYZE and indexable operators selfuncs
carlosganzerla Sep 23, 2025
c96eae6
docs: added README.md
carlosganzerla Oct 10, 2025
d457e70
CI test 1
carlosganzerla Oct 10, 2025
bb9be28
CI 2
carlosganzerla Oct 10, 2025
d56549b
CI 3
carlosganzerla Oct 10, 2025
de2f1f7
CI 4
carlosganzerla Oct 10, 2025
8f00edc
CI 4
carlosganzerla Oct 10, 2025
28789f4
CI 6
carlosganzerla Oct 10, 2025
1ad530c
CI 7
carlosganzerla Oct 10, 2025
083e0fa
CI 8
carlosganzerla Oct 10, 2025
4bc2144
CI 8
carlosganzerla Oct 10, 2025
6d8f5a6
YET AGAIN
carlosganzerla Oct 10, 2025
0b6e083
YET AGAIN
carlosganzerla Oct 10, 2025
550f094
YET AGAIN
carlosganzerla Oct 10, 2025
0e46442
YH
carlosganzerla Oct 10, 2025
75003a3
YH
carlosganzerla Oct 10, 2025
ce99777
YH
carlosganzerla Oct 10, 2025
229fd1a
YH
carlosganzerla Oct 10, 2025
1a4120e
caralho
carlosganzerla Oct 10, 2025
e5d6909
caralho agora vai
carlosganzerla Oct 10, 2025
95b8d6b
AGAIN
carlosganzerla Oct 10, 2025
74cb1fa
AGAINss
carlosganzerla Oct 10, 2025
e9bc133
OH MAGAWD
carlosganzerla Oct 10, 2025
a4393fd
OH MAGAWD
carlosganzerla Oct 10, 2025
998a745
OH MAGAWD
carlosganzerla Oct 10, 2025
45d23f9
OH MAGAWD
carlosganzerla Oct 10, 2025
c1be4b6
yet again
carlosganzerla Oct 10, 2025
aed7c91
merda
carlosganzerla Oct 10, 2025
75d6ec7
merda
carlosganzerla Oct 10, 2025
c5e7fc8
cartalohoo
carlosganzerla Oct 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
benchmark/*.data filter=lfs diff=lfs merge=lfs -text
71 changes: 71 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: build
on:
workflow_dispatch:
push:
branches:
- master
tags:
- "!*" # Do not execute on tags
paths:
- "**/*.c"
- "**/*.h"
- "**/*.sql"
- data/**
- expected/**
- .clangd-format
- .github/workflows/**
- Makefile
- "*.control"
pull_request:
paths:
- "**/*.c"
- "**/*.h"
- "**/*.sql"
- data/**
- expected/**
- .clangd-format
- .github/workflows/**
- Makefile
- "*.control"
branches:
- "**"
env:
PGPORT: 5432
PGUSER: postgres
PGDATABASE: postgres
PGPASSWORD: postgres
PGHOST: localhost

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Update repositories
run: sudo apt -y install wget ca-certificates &&
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - &&
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list' &&
sudo apt-get --purge remove postgresql &&
sudo apt update

- name: Install postgres
run: sudo apt install -y postgresql-17 libpq-dev clang-format postgresql-server-dev-17

- name: Start Postgres
run: sudo systemctl start postgresql && sleep 5

- name: Set password
run: sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'postgres';"

- name: Read Postgres version
run: sudo -u postgres psql -c "SELECT version();"

- name: Install
run: sudo make install

- name: Format check
run: find . -iname '*.h' -o -iname '*.c' | xargs clang-format --dry-run --Werror

- name: Run tests
run: make installcheck || (cat regression.diffs && exit -1)
11 changes: 7 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
MODULE_big = pg_set
OBJS = \
hash_set.o \
pg_set_io.o \
pg_set_op.o
hash_set.o \
pg_set_io.o \
pg_set_op.o \
pg_set_gin.o \
pg_set_gist.o \
pg_set_analyze.o \

EXTENSION = pg_set

DATA = pg_set--1.0.sql

PG_CPPFLAGS = -std=c11 -Wextra -Wpedantic -O0
PG_CPPFLAGS = -std=c11 -Wextra -Wpedantic

REGRESS = pg_set_test

Expand Down
242 changes: 242 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
# `pg_set`

This extensions adds integer sets to PostgreSQL. It provides a new data type
`pg_set` that can store a set of integers (`int4`) efficiently, along with
various functions and operators to manipulate these sets.

## Features

- Written in C
- Efficient storage of integer sets using a mask (it's smaller than `int4[]` for
small enough sets)
- Array-compatible text representation and efficient casting to and from array
- Mathematical set properties and operations: union, intersection, difference,
containment, etc.
- Index support for GIN, GiST and Hash
- Statistic collection and array-like selectivity support for all indexable
operators
- Null elements not supported. This is annoying for arrays and don't really make
sense for sets.

## Motivation

Originally I wrote this extension to learn about more the low-level details of
Postgres types, but I ended up adding more and more features so I decided to
publish it. The case that motivated me to write this extension was one where we
stored sets of IDs that referenced an external table (see the `quasi_monotonic`
on the benchmark). We felt that creating a table was overkill, so we needed to
store sets of values. We also had a requirement that values could not be
duplicated and the order didn't matter, so what we needed was a set, not an
array. Traditionally, PostgreSQL has the following options:

- Use a plain `int4[]` array and ensure uniqueness at the application level or
with functions (e.g.
[`intarray`](https://www.postgresql.org/docs/current/intarray.html)).
- Use [`hstore`](https://www.postgresql.org/docs/current/hstore.html) extension
to store sets of integers as keys
- Use [`jsonb`](https://www.postgresql.org/docs/current/datatype-json.html)
extension to store sets of integers as keys

Back then we used `intarray` with constraints as it's pretty straightforward to
, but I thought that it'd be nice to have a _set_ type to make it natural.
Since I didn't find one, I decided to write this extension to learn a little
bit more about PostgreSQL internals.

## Installation

Clone the repository and run:

```bash
make
sudo make install
```

Then, in your database:

```sql
CREATE EXTENSION pg_set;
```

## Usage

### Basic usage

```sql
-- Creating sets
-- Array-compatible representation
SELECT '{1,2,3}'::pg_set;

-- No duplication
SELECT '{1,1,1,1,1}'::pg_set;

-- With int4 args
SELECT pg_set_create(1,2,3);

CREATE TABLE reference (
id int4 GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
external_ids pg_set NOT NULL
);

INSERT INTO reference (external_ids) VALUES
('{1,2,3}'),
('{3,4,5,6}'),
('{7,8,9}'),
('{1,3,5,7,9}');

-- contains 1 and 3
SELECT * FROM reference WHERE external_ids @> '{1,3}';

-- is contained by the set
SELECT * FROM reference WHERE external_ids <@ '{1,2,3,4,7,8,9}';

-- contains 4
SELECT * FROM reference WHERE external_ids @> 4;

-- Same as above
SELECT * FROM reference WHERE 4 <@ external_ids;

-- contains 1 or 3
SELECT * FROM reference WHERE external_ids && '{1,3}';

-- Equality
SELECT * FROM reference WHERE external_ids = '{1,2,3}';

-- Inequality
SELECT * FROM reference WHERE external_ids <> '{1,2,3}';

-- Count of elements in the set
SELECT * FROM reference WHERE pg_set_count(external_ids) > 3;
```

### Set operations

```sql
-- Union
UPDATE
reference
SET
external_ids = external_ids + '{4}'
RETURNING *;

-- Add element
UPDATE
reference
SET
external_ids = external_ids + 5
RETURNING *;

-- Works on both sides
UPDATE
reference
SET
external_ids = 5 + external_ids
RETURNING *;

-- Can also remove an element
UPDATE
reference
SET
external_ids = external_ids - 5
RETURNING *;

-- Interesection
UPDATE
reference
SET
external_ids = external_ids * '{3,4,5}'
WHERE
external_ids && '{3,4,5}'
RETURNING *;

-- Difference
UPDATE
reference
SET
external_ids = external_ids - '{3,4,5}'
WHERE
external_ids && '{3,4,5}'
RETURNING *;
```

## `ANALYZE` support

Supports almost the statistics collection as arrays (default stats +
`most_common_elems`, `most_common_elem_freqs` and `element_count_histogram`.
Does not support `correlation` and `histogram_bounds` as it sets don't support
less-than operation). Selectivity functions work for the `@>` (both set and
integer cases), `&&` operators and `=` operator (Postgres default works well
here), which are all indexable operators so far.

## Index support

### GiST

The GiST index supports the `@>`, `&&` and `=` operators through the
`gist_pg_set_ops`. The implementation uses an RD-tree data structure with
built-in lossy compression. It approximates sets as a bit mask and also
contains the minimum and maximum set elements to speed up overlap queries at
the expense of index size. It has an optional `masklen` parameter which is the
mask length in bits. It goes from 16 bits (2 bytes) to 16064 bits (2016 bytes).
The default is 16 bytes. A higher `masklen` will increase precision at the
expense of index size. So `masklen` must be balanced with index size for
optimal performance.

```sql
CREATE INDEX ON reference USING gist (external_ids);
CREATE INDEX ON reference USING gist (external_ids gist_pg_set_ops (masklen=2048));
```

#### Exclusion constraints

You can also exclusion constraints using GiST. This makes it great to use in
conjunction with `btree_gist`:

```sql
CREATE EXTENSION btree_gist;

CREATE TABLE room_booking (
id int4 GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
room_id int NOT NULL,
booked_slots pg_set NOT NULL,
EXCLUDE USING gist (room_id WITH =, booked_slots WITH &&)
);

INSERT INTO room_booking (room_id, booked_slots) VALUES
(1, '{1,2,3}'),
(1, '{5,6,7}'),
(2, '{1,2,3}');

INSERT INTO room_booking (room_id, booked_slots) VALUES
(1, '{3,9}'); -- Fails, overlaps with first entry
```

### GIN

Supports the same operators as GiST. Works exactly the similar as an `int4[]`
GIN, as GIN is a tree of elements. It's implemented on the `gin_pg_set_ops`
operator class.

```sql
CREATE INDEX ON reference USING gin (external_ids);
```

### Hash

Supports only `=` as usual. It's implemented on the `hash_pg_set_ops` operator
class.

```sql
CREATE INDEX ON reference USING hash (external_ids);
```

## Benchmark

See [`pg_set_benchmark`](https://github.com/carlosganzerla/pg_set_benchmark).

## Future work

Currently, this extension supports only `int4`. It would be nice to add support
for `int2` and `int8`, but that requires a lot of work, as we need to redefine
basically all functions at the SQL level and find a way to reuse the internals
without compromising performance. The same applies to `float4` and `float8`,
and to pretty much any fixed-length, sortable type.
Loading
Loading