Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
8d21e87
Make model::Type::Destructor take const pointer
p-senichenkov Sep 21, 2025
cb09ffd
Implement base model classes
p-senichenkov Sep 21, 2025
e1c1758
Add PAC-related names and descriptions
p-senichenkov Sep 21, 2025
7e46bbf
Implement base class for PAC verifiers
p-senichenkov Sep 21, 2025
2e31485
Implement Domain PAC verifier
p-senichenkov Sep 23, 2025
6dfe4ac
Implement Domain PAC verifier tests
p-senichenkov Sep 21, 2025
cbc57d9
Implement Domain PAC bindings
p-senichenkov Oct 10, 2025
5636a72
Add Domain PAC verifier Python examples
p-senichenkov Oct 10, 2025
b12245a
Edit PAC verifier to make it suitable for FD and UCC PACs
p-senichenkov Nov 29, 2025
129907c
Add separate "MakeTuples" function
p-senichenkov Nov 29, 2025
8aee84e
Add Tuples typedef
p-senichenkov Nov 29, 2025
b91a844
Fix formatting in bind_pac.cpp
p-senichenkov Nov 29, 2025
f2c1090
Remove PACHighlight base class
p-senichenkov Nov 30, 2025
aee3cc6
Add relational schema to PAC
p-senichenkov Dec 5, 2025
6f264e9
Save rel schema in Domain PAC verifier
p-senichenkov Dec 5, 2025
350ea76
Bind __eq__ for Domain PAC
p-senichenkov Dec 5, 2025
4728350
Fix PAC creation in Domain PAC verifier
p-senichenkov Dec 5, 2025
dca5516
Don't try to set eps and delta on PAC null pointer
p-senichenkov Dec 10, 2025
3d36ca5
Fix comparer in Parallelepiped domain
p-senichenkov Dec 19, 2025
77fa6f9
Normalize includes in bindings, edit Python descriptions
p-senichenkov Oct 10, 2025
71f73bb
Edit PAC-related names and descriptions
p-senichenkov Dec 27, 2025
bda7cb7
Normalize includes in Domain PAC verifier, use another algo
p-senichenkov Jan 26, 2026
ebd1ebc
Normalize includes in PAC verififer, use another algo
p-senichenkov Jan 26, 2026
308642a
Edit tests
p-senichenkov Dec 27, 2025
cb4e8cb
Normalize includes in base model classes, remove comparer
p-senichenkov Dec 27, 2025
bbb84d5
Fix bindigns tests
p-senichenkov Jan 28, 2026
8f42dc2
Some minor changes to PAC verifier (related to min_delta)
p-senichenkov Feb 1, 2026
dac4e49
Do not try to take prev(end) if end == begin in Domain PAC verifier
p-senichenkov Feb 1, 2026
3fc4341
Actualize examples basic-1 and basic-2
p-senichenkov Feb 1, 2026
b0d00ad
Apply review suggestions (examples)
p-senichenkov Feb 7, 2026
4777f39
Apply review suggestions (base model classes)
p-senichenkov Feb 7, 2026
00466ad
Apply review suggestions (PAC verifier)
p-senichenkov Feb 7, 2026
41f12d8
Apply review suggestions (Domain PAC verifier)
p-senichenkov Feb 7, 2026
f43b2c2
Remove unneded names, edit descriptions
p-senichenkov Feb 7, 2026
63f0963
Edit Domain PAC tests
p-senichenkov Feb 7, 2026
859b647
Edit Domain PAC bindings
p-senichenkov Feb 7, 2026
53293e6
Actualize examples basic-3, basic-4, advanced
p-senichenkov Feb 8, 2026
f83c8ed
Fix formatting in PAC verifier
p-senichenkov Feb 8, 2026
5e41dfd
Fix formatting in bindings
p-senichenkov Feb 8, 2026
bf35fc4
Update snapshots for examples tests
p-senichenkov Feb 8, 2026
fa17fc4
Fix formatting in PAC verifier
p-senichenkov Feb 8, 2026
cca8191
Fix formatting in tests
p-senichenkov Feb 8, 2026
279d71f
Do not use epsilon end delta in Domain PAC __hash__
p-senichenkov Feb 14, 2026
a0fd9d6
Fix examples
p-senichenkov Feb 28, 2026
ab7f699
Migrate to target-based CMake
p-senichenkov Mar 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions examples/advanced/verifying_pac/verifying_domain_pac_custom_domain.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
'''Example 1 (advanced): Custom domain'''

from tabulate import tabulate
from csv import reader
from math import sqrt

import desbordante

RED = '\033[31m'
GREEN = '\033[32m'
BLUE = '\033[34m'
CYAN = '\033[36m'
BOLD = '\033[1;37m'
ENDC = '\033[0m'

USER_PREFERENCES = 'examples/datasets/verifying_pac/user_preferences.csv'


def csv_to_str(filename: str) -> str:
with open(filename, newline='') as table:
rows = list(reader(table, delimiter=','))
headers = rows[0]
rows = rows[1:]
return tabulate(rows, headers=headers)


print(
f'''This example illustrates the usage of Domain Probabilistic Approximate Constraints (Domain PACs).
A Domain PAC on a column set X and domain D, with given ε and δ means that {BOLD}Pr(x ∈ D±ε) ≥ δ{ENDC}.
For more information, see "Checks and Balances: Monitoring Data Quality Problems in Network
Traffic Databases" by Filp Korn et al (Proceedings of the 29th VLDB Conference, Berlin, 2003).
If you have not read the basic Domain PAC examples yet (see the {CYAN}examples/basic/verifying_pac/{ENDC}
directory), it is recommended to start there.

Assume we have a dataset of user preferences, where each user\'s interest in several topics is
encoded as values in [0, 1], where 0 is "not interested at all" and 1 is "very interested":
{BOLD}{csv_to_str(USER_PREFERENCES)}{ENDC}

We need to estimate whether this group of users will be interested in the original Domain PAC paper
("Checks and Balances: ...").
To do this, we represent each user profile as a vector in a multi-dimensional topic space:
^ Topic 2
|
| user
| x
| /
|/ Topic 1
-+------->
|

A "perfect" target reader might have the profile: {BLUE}(0.9, 0.4, 0.05){ENDC}.
This corresponds to:
* high interest in Databases;
* moderate interest in Networks;
* low interest in Machine Learning.
Our goal is to measure how close real users are to this ideal profile.

We use cosine distance, which measures the angle between two vectors rather then their absolute
length. This is useful because we care about interest proportions, not total magnitude.
{BOLD}dist(x, y) = 1 - cos(angle between x and y) = 1 - (x, y)/(|x| * |y|){ENDC},
where (x, y) is a dot product between x and y.
''')


def cosine_dist(x: list[float], y: list[float]) -> float:
dot_product = 0
x_length = 0
y_length = 0
for i in range(len(x)):
dot_product += x[i] * y[i]
x_length += x[i] * x[i]
y_length += y[i] * y[i]
x_length = sqrt(x_length)
y_length = sqrt(y_length)
return 1 - dot_product / (x_length * y_length)


print(f'''A custom domain is defined by two parameters:
1. Distance function -- takes a value tuple and returns the distance to the domain.
2. Domain name (optional) -- used for readable output.
In this example:
* distance function: {BLUE}dist(x, (0.9, 0.4, 0.05)){ENDC};
* domain name: {BLUE}"(0.9, 0.4, 0.05)"{ENDC}.
This effectively defines the domain as "users close to the ideal profile".
''')

PERFECT_USER = [0.9, 0.4, 0.05]


# Argument is always a list of strings
def dist_from_domain(x: list[str]) -> float:
x_f = [float(x_i) for x_i in x]
return cosine_dist(x_f, PERFECT_USER)


domain = desbordante.pac.domains.CustomDomain(dist_from_domain,
"(0.9, 0.4, 0.05)")

print(f'We run the Domain PAC verifier with domain={BLUE}{domain}{ENDC}.')
algo = desbordante.pac_verification.algorithms.DomainPACVerifier()
algo.load_data(table=(USER_PREFERENCES, ',', True),
domain=domain,
column_indices=[0, 1, 2])
algo.execute()
pac_1 = algo.get_pac()
print(f'''Algorithm result:
{GREEN}{pac_1}{ENDC}
Now we lower the required probability threshold: min_delta={BLUE}0.6{ENDC}.''')

algo.execute(min_delta=0.6)
pac_2 = algo.get_pac()

print(f'''Algorithm result:
{GREEN}{pac_2}{ENDC}
Interpretation:
* With a larger ε ({BLUE}{pac_1.epsilon:.3f}{ENDC}), nearly all users show some level of interest
* With a very small ε ({BLUE}{pac_2.epsilon:.3f}{ENDC}), only {BLUE}{pac_2.delta * 100.0:.0f}%{ENDC} of users closely match the ideal reader.

You can check outliers to identify which users are closer to or farther from the ideal profile.
For an introduction to outliers, see {CYAN}examples/basic/verifying_pac/verifying_domain_pac1.py{ENDC}.

You now know how to use Domain PACs with built-in domains as well as custom ones.
Try applying them to your own data and see what insights you can uncover.'''
)

# C++ note: Custom domain is called "Untyped domain" in C++ code, because it erases type
# information, converting all values to strings. If you use C++ library, it's recommended to
# implement IDomain interface or derive from MetricBasedDomain (if your domain is based on
# coordinate-wise metrics). See Parallelepiped and Ball implementations as examples.
169 changes: 169 additions & 0 deletions examples/basic/verifying_pac/verifying_domain_pac1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
'''Example 1: 1D segment, highlights'''

from tabulate import tabulate
from csv import reader

import desbordante

RED = '\033[31m'
YELLOW = '\033[33m'
BOLD_YELLOW = '\033[1;33m'
GREEN = '\033[32m'
BLUE = '\033[34m'
CYAN = '\033[36m'
BOLD = '\033[1;37m'
ENDC = '\033[0m'

ENGINE_TEMPS_BAD = 'examples/datasets/verifying_pac/engine_temps_bad.csv'
ENGINE_TEMPS_GOOD = 'examples/datasets/verifying_pac/engine_temps_good.csv'


def read_column(filename: str, col_num: int) -> (str, list[str]):
with open(filename, newline='') as table:
rows = list(reader(table, delimiter=','))
header = rows[0][col_num]
values = [row[col_num] for row in rows[1:]]
return header, values


def column_to_str(filename: str, col_num: int) -> str:
header, values = read_column(filename, col_num)
values_str = ', '.join(values)
return f'{BOLD}{header}: [{values_str}]{ENDC}'


def display_columns_diff(filename_old: str, col_num_old: int,
filename_new: str, col_num_new: int) -> str:
_, values_old = read_column(filename_old, col_num_old)
header, values_new = read_column(filename_new, col_num_new)
values = []
for i in range(len(values_new)):
value = values_new[i]
if values_old[i] != value:
value = f'{BOLD_YELLOW}' + value + f'{BOLD}'
values.append(value)
values_str = ', '.join(values)
return f'{BOLD}{header}: [{values_str}]{ENDC}'


print(
f'''This example illustrates the usage of Domain Probabilistic Approximate Constraints (PACs).
A Domain PAC on column set X and domain D, with given ε and δ means that Pr(x ∈ D±ε) ≥ δ.
For more information consult "Checks and Balances: Monitoring Data Quality Problems in Network
Traffic Databases" by Flip Korn et al (Proceedings of the 29th VLDB Conference, Berlin, 2003).

This is the first example in the "Basic Domain PAC verification" series. Others can be found in
{CYAN}examples/basic/verifying_pac/{ENDC} directory.
''')

print(
f'''Suppose we are working on a new model of engine. Its operating temperature range is {BLUE}[85, 95]{ENDC}°C.
The engine is made of high-strength metal, so short-term temperature deviations are acceptable and
will not cause immediate damage. In other words, engine operates properly when Pr(t ∈ [85, 95]±ε) ≥ δ.
Based on engineering analysis, the acceptable limits are: ε = {BLUE}5{ENDC}, δ = {BLUE}0.9{ENDC}.

In terms of Domain PACs, the following constraint should hold: {BLUE}Pr(x ∈ [85, 95]±5) ≥ 0.9{ENDC}.
''')

print(
'The following table contains readings from the engine temperature sensor:'
)
# Values are printed in one line for brevity, original table is single-column
print(f'{column_to_str(ENGINE_TEMPS_BAD, 0)}')
print()

print(
'We now use the Domain PAC verifier to determine whether the engine is operating safely.'
)

print(
f'''First, we need to define the domain. Available options are:
* {BLUE}Parallelepiped{ENDC} -- a closed n-ary parallelepiped
* {BLUE}Ball{ENDC} -- a closed n-ary ball
* {BLUE}CustomDomain{ENDC} -- a domain with user-defined metric
A segment is simply a one-dimensional parallelepiped, so we use the {BLUE}Parallelepiped{ENDC} domain here.''')
# Parallelepiped has a special constructor for segment.
# Notice the usage of quotes: these strings will be converted to values once the table is loaded.
segment = desbordante.pac.domains.Parallelepiped('85', '95')

print(
f'''We run algorithm with the following options: domain={BLUE}{segment}{ENDC}. All other parameters use default
values: min_epsilon={BLUE}0{ENDC}, max_epsilon={BLUE}∞{ENDC}, min_delta={BLUE}0.9{ENDC}, delta_steps={BLUE}100{ENDC}.
''')

algo = desbordante.pac_verification.algorithms.DomainPACVerifier()
# Note that domain should be set in `load_data`, not `execute`
algo.load_data(table=(ENGINE_TEMPS_BAD, ',', True),
column_indices=[0],
domain=segment)
algo.execute()

print(f'Algorithm result: {YELLOW}{algo.get_pac()}{ENDC}.\n')
print(
f'''This result is not directly informative for our goal. Since both ε and δ exceed the required values,
we cannot determine whether the constraint holds for ε={BLUE}5{ENDC} and δ={BLUE}0.9{ENDC}.

Let\'s run algorithm with min_epsilon={BLUE}5{ENDC} and max_epsilon={BLUE}5{ENDC}. This will give us the exact δ,
for which PAC with ε={BLUE}5{ENDC} holds.
''')

# Note that, when min_epsilon or max_epsilon is specified, default min_delta becomes 0
algo.execute(min_epsilon=5, max_epsilon=5)

print(f'Algorithm result: {RED}{algo.get_pac()}{ENDC}.\n')
print(
f'''Also, let\'s run algorithm with max_epsilon={BLUE}0{ENDC} and min_delta={BLUE}0.9{ENDC} to check which ε
is needed to satisfy δ={BLUE}0.9{ENDC}. With these parameters algorithm enters special mode and returns
pair (ε, min_delta), so that we can validate PAC with the given δ.
''')

# Actually, algorithm enters this mode whenever max_epsilon is less than epsilon needed to satisfy
# min_delta.
algo.execute(max_epsilon=0, min_delta=0.9)

pac = algo.get_pac()
print(f'Algorithm result: {RED}{pac}{ENDC}.\n')
print(
f'''Here algorithm gives δ={BLUE}{pac.delta}{ENDC}, which is greater than {BLUE}0.9{ENDC}, because achieving δ={BLUE}0.9{ENDC} requires
ε={BLUE}{pac.epsilon}{ENDC} and PAC ({BLUE}{pac.epsilon}{ENDC}, {BLUE}{pac.delta}{ENDC}) holds. So, this means that δ={BLUE}0.9{ENDC} would also require ε={BLUE}{pac.epsilon}{ENDC}.
''')

print(
'We can see that desired PAC doesn\'t hold, so the engine can blow up!\n')

print(
f'''Let\'s look at values violating PAC. Domain PAC verifier can detect values between eps_1
and eps_2, i. e. values that lie in D±eps_2 \\ D±eps_1. Such values are called outliers (or highlights).
Let\'s find outliers for different eps_1, eps_2 values:''')

value_ranges = [(0, 1), (1, 2), (2, 3), (3, 5), (5, 7), (7, 10)]
highlights_table = [(f'{BLUE}{v_range[0]}{ENDC}', f'{BLUE}{v_range[1]}{ENDC}',
str(algo.get_highlights(*v_range)))
for v_range in value_ranges]
print(tabulate(highlights_table, headers=('eps_1', 'eps_2', 'outliers')))
print()

print('''We can see two problems:
1. The engine operated at low temperatures for an extended period, slightly below 80°C.
2. The peak temperature was too high, but this occurred only once.\n''')

print('''The second version of engine has:
1. A pre-heating system to prevent operation at low temperatures.
2. An emergency cooling system to limit peak temperatures.
The updated sensor readings (modified values highlighted) are:''')
print(f'{display_columns_diff(ENGINE_TEMPS_BAD, 0, ENGINE_TEMPS_GOOD, 0)}')
print()

print(f'''We run the Domain PAC verifier again.''')
algo = desbordante.pac_verification.algorithms.DomainPACVerifier()
algo.load_data(table=(ENGINE_TEMPS_GOOD, ',', True),
column_indices=[0],
domain=segment)
algo.execute()

print(f'''Algorithm result: {GREEN}{algo.get_pac()}{ENDC}.

The desired PAC now holds, which means the improved engine operates within acceptable limits.

It is recommended to continue with the second example ({CYAN}examples/basic/verifying_pac/verifying_domain_pac2.py{ENDC}),
which demonstrates more advanced usage of the Parallelepiped domain.''')
Loading
Loading