Skip to content

Commit 88d6b32

Browse files
committed
Actualize examples basic-1 and basic-2
1 parent 2d31789 commit 88d6b32

File tree

3 files changed

+153
-90
lines changed

3 files changed

+153
-90
lines changed

examples/basic/verifying_pac/verifying_domain_pac1.py

Lines changed: 92 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -6,79 +6,128 @@
66
import desbordante
77

88
RED = '\033[31m'
9+
YELLOW = '\033[33m'
10+
BOLD_YELLOW = '\033[1;33m'
911
GREEN = '\033[32m'
1012
BLUE = '\033[34m'
1113
CYAN = '\033[36m'
12-
GRAY = '\033[1;30m'
14+
BOLD = '\033[1;37m'
1315
ENDC = '\033[0m'
1416

1517
ENGINE_TEMPS_BAD = 'examples/datasets/verifying_pac/engine_temps_bad.csv'
1618
ENGINE_TEMPS_GOOD = 'examples/datasets/verifying_pac/engine_temps_good.csv'
1719

1820

19-
def column_to_str(filename: str) -> str:
21+
def read_column(filename: str, col_num: int) -> (str, list[str]):
2022
with open(filename, newline='') as table:
2123
rows = list(reader(table, delimiter=','))
22-
headers = rows[0]
23-
rows = rows[1:]
24-
values = ', '.join(list(map(lambda row: str(row[0]), rows)))
25-
return f'{headers[0]}: [{values}]'
24+
header = rows[0][col_num]
25+
values = [row[col_num] for row in rows[1:]]
26+
return header, values
27+
28+
29+
def column_to_str(filename: str, col_num: int) -> str:
30+
header, values = read_column(filename, col_num)
31+
values_str = ', '.join(values)
32+
return f'{BOLD}{header}: [{values_str}]{ENDC}'
33+
34+
35+
def display_columns_diff(filename_old: str, col_num_old: int,
36+
filename_new: str, col_num_new: int) -> str:
37+
_, values_old = read_column(filename_old, col_num_old)
38+
header, values_new = read_column(filename_new, col_num_new)
39+
values = []
40+
for i in range(len(values_new)):
41+
value = values_new[i]
42+
if values_old[i] != value:
43+
value = f'{BOLD_YELLOW}' + value + f'{BOLD}'
44+
values.append(value)
45+
values_str = ', '.join(values)
46+
return f'{BOLD}{header}: [{values_str}]{ENDC}'
2647

2748

2849
print(
29-
f'''{CYAN}This example illustrates the usage of Domain Probabilistic Approximate Constraints (PACs).
30-
Domain PAC on column set X and domain D, with given epsilon and delta means that Pr(x ∈ D±epsilon) ≥ delta.
50+
f'''This example illustrates the usage of Domain Probabilistic Approximate Constraints (PACs).
51+
A Domain PAC on column set X and domain D, with given ε and δ means that Pr(x ∈ D±ε) ≥ δ.
3152
For more information consult "Checks and Balances: Monitoring Data Quality Problems in Network
32-
Traffic Databases" by Flip Korn et al.
53+
Traffic Databases" by Flip Korn et al (Proceedings of the 29th VLDB Conference, Berlin, 2003).
3354
34-
This is the first example in "Basic Domain PAC verification" series. Others can be found in
35-
examples/basic/verifying_pac/{ENDC} directory.
55+
This is the first example in the "Basic Domain PAC verification" series. Others can be found in
56+
{CYAN}examples/basic/verifying_pac/{ENDC} directory.
3657
''')
3758

3859
print(
39-
f'''Consider we are working on a new model of engine. It\'s working temperatures lie in span {BLUE}[85, 95]{ENDC}°C.
40-
Engine is made of high-strenght metal, so slight short-term temperature deviations won\'t kill it.
41-
In other words, engine works properly when Pr(t ∈ [85, 95]±epsilon) ≥ delta.
42-
Our engieneers have figured out limits: epsilon = {BLUE}5{ENDC}, delta = {BLUE}0.9{ENDC}.
43-
So, in terms of Domain PACs, {BLUE}Domain PAC Pr(x ∈ [85, 95]±5) ≥ 0.9{ENDC} should hold.
60+
f'''Suppose we are working on a new model of engine. Its operating temperature range is {BLUE}[85, 95]{ENDC}°C.
61+
The engine is made of high-strength metal, so short-term temperature deviations are acceptable and
62+
will not cause immediate damage. In other words, engine operates properly when Pr(t ∈ [85, 95]±ε) ≥ δ.
63+
Based on enginnering analysis, the acceptable limits are: ε = {BLUE}5{ENDC}, δ = {BLUE}0.9{ENDC}.
64+
In terms of Domain PACs, the following constraint should hold: {BLUE}Pr(x ∈ [85, 95]±5) ≥ 0.9{ENDC}.
4465
''')
4566

46-
print('The following table contains readings of engine temperature sensor:')
67+
print(
68+
'The following table contains readings from the engine temperature sensor:'
69+
)
4770
# Values are printed in one line for brevity, original table is single-column
48-
print(f'{GRAY}{column_to_str(ENGINE_TEMPS_BAD)}{ENDC}')
71+
print(f'{column_to_str(ENGINE_TEMPS_BAD, 0)}')
4972
print()
5073

51-
print('Let\'s use Domain PAC verifier to check if engine will be damaged\n')
74+
print(
75+
'We now use the Domain PAC verifier to determine whether the engine is operating safely.'
76+
)
5277

5378
print(
54-
'Firstly we need to create domain. Segment is a special case of parallelepiped, so let\'s use it.'
79+
'First, we need to define the domain. A segment is a special case of a parallelepiped, so we use it here.'
5580
)
5681
# Parallelepiped has a special constructor for segment.
57-
# Notice the usage of quotes: these strings will be converted to values once table is loaded.
82+
# Notice the usage of quotes: these strings will be converted to values once the table is loaded.
5883
segment = desbordante.pac.domains.Parallelepiped('85', '95')
5984

60-
# TODO(senichenkov): diagonal_threshold example
6185
print(
62-
f'''Now let\'s run algorithm with the following options: domain={BLUE}{segment}{ENDC}, max_epsilon={BLUE}10{ENDC}, min_delta={BLUE}0.85{ENDC}.
63-
Max_epsilon should be greater than desired epsilon, min_delta -- a little less than the expected one.
64-
We will use default values of other options: min_epsilon={BLUE}0{ENDC}, epsilon_steps={BLUE}100{ENDC}, diagonal_threshold={BLUE}1e-5{ENDC}.
65-
Min_epsilon, max_epsilon and epsilon_steps contol which epsilon values and how many of them will be checked by the algorithm.
66-
Diagonal threshold is advanced parameter, that is explained in {CYAN}examples/advanced/verifying_pac/##EXAMPLE_NAME##{ENDC}.
86+
f'''We run algorithm with the following options: domain={BLUE}{segment}{ENDC}.
87+
All other parameters use default values: min_epsilon={BLUE}0{ENDC}, max_epsilon={BLUE}{ENDC}, min_delta={BLUE}0.9{ENDC}, delta_steps={BLUE}100{ENDC}.
6788
''')
89+
6890
algo = desbordante.pac_verification.algorithms.DomainPACVerifier()
6991
# Note that domain should be set in `load_data`, not `execute`
7092
algo.load_data(table=(ENGINE_TEMPS_BAD, ',', True),
7193
column_indices=[0],
7294
domain=segment)
73-
algo.execute(max_epsilon=10, min_delta=0.85)
95+
algo.execute()
96+
97+
print(f'Algorithm result: {YELLOW}{algo.get_pac()}{ENDC}.')
98+
print(
99+
f'''This PAC is not very informative. Let\'s run algorithm with min_epsilon={BLUE}5{ENDC} and max_epsilon={BLUE}5{ENDC}.
100+
This will give us the exact δ, for which PAC with ε={BLUE}5{ENDC} holds.
101+
''')
102+
103+
# Note that, when min_epsilon or max_epsilon is specified, default min_delta becomes 0
104+
algo.execute(min_epsilon=5, max_epsilon=5)
74105

75106
print(f'Algorithm result: {RED}{algo.get_pac()}{ENDC}.')
76-
print('Uh-oh! The desired PAC doesn\'t hold, so engine can blow up!\n')
107+
print(
108+
f'''Also, let\'s run algorithm with max_epsilon={BLUE}0{ENDC} and min_delta={BLUE}0.9{ENDC} to check which ε
109+
is needed to satisfy δ={BLUE}0.9{ENDC}. With these parameters algorithm enter special mode and returns
110+
pair (ε, min_delta), so that we can validate PAC with the given δ.
111+
''')
112+
113+
# Actually, algorithm enter this mode whenever max_epsilon is less than epsislon needed to satisfy
114+
# min_delta.
115+
algo.execute(max_epsilon=0, min_delta=0.9)
116+
117+
pac = algo.get_pac()
118+
print(f'Algorithm result: {RED}{pac}{ENDC}.')
119+
print(
120+
f'''Here algorithm gives δ={BLUE}{pac.delta}{ENDC}, which is greater than {BLUE}0.9{ENDC}, because achieving δ={BLUE}0.9{ENDC} requires
121+
ε={BLUE}{pac.epsilon}{ENDC} and PAC ({BLUE}{pac.epsilon}{ENDC}, {BLUE}{pac.delta}{ENDC}) holds. So, this means that δ={BLUE}0.9{ENDC} would also require ε={BLUE}{pac.epsilon}{ENDC}.
122+
''')
123+
124+
print(
125+
'We can see that desired PAC doesn\'t hold, so the engine can blow up!\n')
77126

78127
print(
79128
f'''Let\'s look at values violating PAC. Domain PAC verifier can detect values between eps_1
80-
and eps_2, i. e. values that lie in D±eps_2 \\ D±eps_1. Such values are called highlights.
81-
Let\'s check highlights for different eps_1, eps_2 values:''')
129+
and eps_2, i. e. values that lie in D±eps_2 \\ D±eps_1. Such values are called highlights or outliers.
130+
Let\'s find outliers for different eps_1, eps_2 values:''')
82131

83132
value_ranges = [(0, 1), (1, 2), (2, 3), (3, 5), (5, 7), (7, 10)]
84133
highlights_table = [(f'{BLUE}{v_range[0]}{ENDC}', f'{BLUE}{v_range[1]}{ENDC}',
@@ -88,36 +137,25 @@ def column_to_str(filename: str) -> str:
88137
print()
89138

90139
print('''We can see two problems:
91-
1. Time engine worked on low temperature was too long, but these temperatures were just a little lower than 80°C.
92-
2. Peak temperature was too high, but it has been reached only once.\n'''
93-
)
94-
95-
print('''Second version of engine has:
96-
1. pre-heating system to prevent engine from working on low temperatures;
97-
2. emergency cooling system to lower peak temperatures.
98-
Let\'s look at second version\'s sensor readings:''')
99-
print(f'{GRAY}{column_to_str(ENGINE_TEMPS_GOOD)}{ENDC}')
140+
1. The engine operated at low temperatures for an extended period, slightly below 80°C.
141+
2. The peak temperature was too high, but this occured only once.\n''')
142+
143+
print('''The second version of engine has:
144+
1. A pre-heating system to prevent operation at low temperatures.
145+
2. An emergency cooling system to limit peak temperatures.
146+
The updated sensor readings (modified values highlighted) are:''')
147+
print(f'{display_columns_diff(ENGINE_TEMPS_BAD, 0, ENGINE_TEMPS_GOOD, 0)}')
100148
print()
101149

102-
print(
103-
f'''Let\'s run Domain PAC verifier with same parameters (domain={BLUE}{segment}{ENDC}, max_epsilon={BLUE}10{ENDC}).'''
104-
)
150+
print(f'''We run the Domain PAC verifier again.''')
105151
algo = desbordante.pac_verification.algorithms.DomainPACVerifier()
106152
algo.load_data(table=(ENGINE_TEMPS_GOOD, ',', True),
107153
column_indices=[0],
108154
domain=segment)
109-
algo.execute(max_epsilon=10, min_delta=0.85)
110-
111-
print(f'Algorithm result: {RED}{algo.get_pac()}{ENDC}.')
112-
print(
113-
f'''This PAC says that epsilon={BLUE}6{ENDC} is enough to cover all possible values. It\'s true, but this information is useless.
114-
Let\'s select max_epsilon={BLUE}5.5{ENDC} and min_delta={BLUE}0.87{ENDC} to give algorithm a hint which values do we want.'''
115-
)
116-
# Max_epsilon is an "execute option", so `load_data` is not needed
117-
algo.execute(max_epsilon=5.5, min_delta=0.87)
155+
algo.execute()
118156

119157
print(f'''Algorithm result: {GREEN}{algo.get_pac()}{ENDC}.
120-
Our desired PAC holds, which means that engine works well.
158+
The desired PAC now holds, which means the improved engine operates within acceptable limits.
121159
122-
It's recommended to continue with reading second example ({CYAN}examples/basic/verifying_pac/verifying_domain_pac2.py{ENDC}),
160+
It is recommended to continue with the second example ({CYAN}examples/basic/verifying_pac/verifying_domain_pac2.py{ENDC}),
123161
which demonstrates more advanced usage of the Parallelepiped domain.''')

examples/basic/verifying_pac/verifying_domain_pac2.py

Lines changed: 60 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
GREEN = '\033[32m'
1010
BLUE = '\033[34m'
1111
CYAN = '\033[36m'
12-
GRAY = '\033[1;30m'
12+
BOLD = '\033[1;37m'
1313
ENDC = '\033[0m'
1414

1515
ENGINE_TEMPS = 'examples/datasets/verifying_pac/engine_temps_bad.csv'
@@ -24,71 +24,96 @@ def csv_to_str(filename: str) -> str:
2424

2525

2626
print(
27-
f'''{CYAN}This example illustrates the usage of Domain Probabilistic Approximate Constraints (PACs).
28-
This example continues first Domain PAC verification example (examples/basic/verifying_pac/verifying_domain_pac1.py).
29-
If you haven't read first part, start with reading it.{ENDC}
27+
f'''This example illustrates the usage of Domain Probabilistic Approximate Constraints (Domain PACs)
28+
on multiple columns. It continues the first Domain PAC verification example
29+
{CYAN}(examples/basic/verifying_pac/verifying_domain_pac1.py){ENDC}. If you have not read the first part yet,
30+
it is recommended to start there.
3031
''')
3132

3233
print(
33-
f'''In first example we\'ve verified {BLUE}Domain PAC Pr(x ∈ [85, 95]±5) ≥ 0.9{ENDC} on temperature sensor readings.
34-
Now we also have tachometer readings:''')
35-
print(f'{GRAY}{csv_to_str(ENGINE_TEMPS)}{ENDC}')
34+
f'''In first example we verified {BLUE}Domain PAC Pr(x ∈ [85, 95]±5) ≥ 0.9{ENDC} on engine temperature sensor readings.
35+
Now, in addition to temperature readings, we also have tachometer data:''')
36+
print(f'{BOLD}{csv_to_str(ENGINE_TEMPS)}{ENDC}')
3637
print()
3738

3839
print(
39-
f'''Normal RPM for our engine lie in {BLUE}[1500, 3500]{ENDC}. Values in {BLUE}[0, 5000]{ENDC} are not harmful themselves, but
40-
cold engine will stall on low RPM and can be damaged on high RPM, and overheated engine can blow up, especially on RPM
41-
outside of {BLUE}[1500, 3500]{ENDC} (because cooling system operation depends on RPM).
42-
Let\'s use Domain PAC verifier to check if engine works properly, just like we did in first example.
40+
f'''The normal operating RPM for this engine is {BLUE}[1500, 3500]{ENDC}. Values outside this range are
41+
not harmful by themselves (as long as they are within {BLUE}[0, 5000]{ENDC}), but:
42+
* A cold engine may stall at low RPM and can be damaged at high RPM.
43+
* An overheated engine is especially vulnerable at RPM values outside [1500, 3500], because
44+
cooling efficiency depends on RPM.
45+
As in the first example, we use the Domain PAC verifier to check whether the engine operates properly.
4346
''')
4447

4548
print(
4649
f'''Firstly we need to create domain. We have a cartesian product of two segments: {BLUE}[85, 95] x [1500, 3500]{ENDC},
4750
so it would be natural to use parallelepiped.''')
48-
# Arguments of generic version are A = (a0, a1, ..., an) and B = (b0, b1, ..., bn).
49-
# Domain is [A, B] = [a0, b0] x [a1, b1] x ... x [an, bn].
51+
print(
52+
f'''We now work with two columns: temperature and RPM. The acceptable operating region is a Cartesian product
53+
of two segments:
54+
* temperature: [85, 95];
55+
* RPM: [1500, 3500].
56+
This forms a parallelepiped domain: {BLUE}[85, 95] x [1500, 3500]{ENDC}.
57+
''')
58+
# Arguments of generic version are A = (a1, a2, ..., an) and B = (b1, b2, ..., bn).
59+
# Domain is [A, B] = [a1, b1] x [a2, b2] x ... x [an, bn].
5060
parallelepiped = desbordante.pac.domains.Parallelepiped(['85', '1500'],
5161
['95', '3500'])
5262

5363
print(
54-
f'Let\'s run Domain PAC verifier with domain={BLUE}{parallelepiped}{ENDC}, max_epsilon={BLUE}15{ENDC}, min_delta={BLUE}0.85{ENDC}.'
55-
)
64+
f'''We run the Domain PAC verifier with the following parameters: domain={BLUE}{parallelepiped}{ENDC},
65+
max_epsilon={BLUE}15{ENDC}.
66+
''')
5667
algo = desbordante.pac_verification.algorithms.DomainPACVerifier()
5768
algo.load_data(table=(ENGINE_TEMPS, ',', True),
5869
domain=parallelepiped,
5970
column_indices=[0, 1])
60-
algo.execute(max_epsilon=15, min_delta=0.85)
71+
algo.execute(max_epsilon=10)
6172
pac = algo.get_pac()
6273
print(f'Algorithm result: {RED}{pac}{ENDC}.')
6374
print(
64-
'Delta is less than min_delta? What does it mean? Let\'s look at highlights:'
75+
f'A result with δ = {BLUE}{pac.delta}{ENDC} is unexpected. To understand what is happening, we examine the highlights.'
6576
)
6677

67-
# Default values are eps_1 = min_epsilon and eps_2 = pac.epsilon
6878
print(
69-
f'Highlights between {BLUE}0{ENDC} and {BLUE}{pac.epsilon}{ENDC}: {algo.get_highlights()}\n'
70-
)
71-
print(f'''There are no highlights! Maybe we\'ve selected wrong parameters?
72-
Which epsilon should we use: {BLUE}10{ENDC} for temperatures or {BLUE}1500{ENDC} for RPM? We\'ll need some math to answer this.
73-
74-
Parallelepiped uses Chebyshev metric to calculate distance between value tuples: {GRAY}d(x, y) = max{{|x[0] - y[0]|, ..., |x[n] - y[n]|}}{ENDC}.
75-
But in our case difference between RPM values will always be greater than difference between temperatures!
76-
For such situations all coodinate-wise-metric-based domains (currently it's Parallelepiped and Ball, but you can
77-
define your own in C++) have additional parameter: list of levelling coefficients. When it\'s passed,
78-
distance becomes {GRAY}d(x, y) = max{{|x[0] - y[0]| * lc[0], ..., |x[n] - y[n]| * lc[n]}}{ENDC}.
79-
80-
Let\'s use levelling_coefficients={BLUE}[1, 0.01]{ENDC} to normalize temperatures and RPM.'''
81-
)
82-
# Also parallelepiped uses product compare: x < y iff x[0] < y[0] & ... & x[n] < y[n]
79+
f'''Highlights between {BLUE}0{ENDC} and {BLUE}{pac.epsilon}{ENDC} are: {BOLD}{algo.get_highlights(0, pac.epsilon)}{ENDC}.
80+
''')
81+
print(
82+
f'''There are very few hihglights, which suggests that the parameters may not be chosen correctly.
83+
84+
The question is: what does ε = {BLUE}{pac.epsilon}{ENDC} mean in two-dimensional domain? Should ε correspond to:
85+
* 10 degrees of temperature difference, or
86+
* 1500 RPM difference?
87+
To answer this, we need to understand how distnace is computed.
88+
89+
The parallelepiped uses the Chebyshev metric to calculate distance between value tuples:
90+
{BOLD}d(x, y) = max{{|x[1] - y[1]|, ..., |x[n] - y[n]|}}{ENDC}
91+
In our case:
92+
* temperature differences are on the order of tens;
93+
* RPM differences are on the order of thousands.
94+
As a result, RPM differences dominate the distance computation, making temperature differences
95+
almost irrelevant. This issue affects all coordinate-wise metric-based domains (currently
96+
Parallelepiped and Ball, though custom domains can be implemented in C++).
97+
98+
To address this, such domains support {CYAN}levelling coefficients{ENDC}, which rescale individual
99+
dimensions. With levelling coefficients, the distance becomes:
100+
{BOLD}d(x, y) = max{{|x[1] - y[1]| * lc[1], ..., |x[n] - y[n]| * lc[n]}}{ENDC}
101+
To normalize temperatures and RPM scales, we use levelling_coefficients={BLUE}[1, 0.01]{ENDC} parameter.
102+
This treats a 100 RPM difference as comparable to a 1°C difference.
103+
104+
With levelling coefficients applied, we rerun the algorithm.
105+
''')
106+
83107
parallelepiped = desbordante.pac.domains.Parallelepiped(['85', '1500'],
84108
['95', '3500'],
85109
[1, 0.01])
86110
algo = desbordante.pac_verification.algorithms.DomainPACVerifier()
87111
algo.load_data(table=(ENGINE_TEMPS, ',', True),
88112
domain=parallelepiped,
89113
column_indices=[0, 1])
90-
algo.execute(max_epsilon=15, min_delta=0.85)
114+
algo.execute(max_epsilon=10, min_delta=0.9)
91115
print(f'''Algorithm result: {GREEN}{algo.get_pac()}{ENDC}.
116+
This result is now meaningful and consistent with the findings from the first example.
92117
93-
Now it\'s recommended to continue with reading third example ({CYAN}examples/basic/verifying_pac/verifying_domain_pac3.py{ENDC}),
94-
which introduces another basic domain: Ball.''')
118+
It is recommended to continue with the third example ({CYAN}examples/basic/verifying_pac/verifying_domain_pac3.py{ENDC}),
119+
which introduces another basic domain type: Ball.''')

examples/datasets/verifying_pac/engine_temps_good.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ t
1111
89
1212
94
1313
96
14-
101
14+
100
1515
93
1616
90
1717
88

0 commit comments

Comments
 (0)