-
Notifications
You must be signed in to change notification settings - Fork 95
Implement Domain PAC validation #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
p-senichenkov
wants to merge
45
commits into
Desbordante:main
Choose a base branch
from
p-senichenkov:PAC-validation
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
8d21e87
Make model::Type::Destructor take const pointer
p-senichenkov cb09ffd
Implement base model classes
p-senichenkov e1c1758
Add PAC-related names and descriptions
p-senichenkov 7e46bbf
Implement base class for PAC verifiers
p-senichenkov 2e31485
Implement Domain PAC verifier
p-senichenkov 6dfe4ac
Implement Domain PAC verifier tests
p-senichenkov cbc57d9
Implement Domain PAC bindings
p-senichenkov 5636a72
Add Domain PAC verifier Python examples
p-senichenkov b12245a
Edit PAC verifier to make it suitable for FD and UCC PACs
p-senichenkov 129907c
Add separate "MakeTuples" function
p-senichenkov 8aee84e
Add Tuples typedef
p-senichenkov b91a844
Fix formatting in bind_pac.cpp
p-senichenkov f2c1090
Remove PACHighlight base class
p-senichenkov aee3cc6
Add relational schema to PAC
p-senichenkov 6f264e9
Save rel schema in Domain PAC verifier
p-senichenkov 350ea76
Bind __eq__ for Domain PAC
p-senichenkov 4728350
Fix PAC creation in Domain PAC verifier
p-senichenkov dca5516
Don't try to set eps and delta on PAC null pointer
p-senichenkov 3d36ca5
Fix comparer in Parallelepiped domain
p-senichenkov 77fa6f9
Normalize includes in bindings, edit Python descriptions
p-senichenkov 71f73bb
Edit PAC-related names and descriptions
p-senichenkov bda7cb7
Normalize includes in Domain PAC verifier, use another algo
p-senichenkov ebd1ebc
Normalize includes in PAC verififer, use another algo
p-senichenkov 308642a
Edit tests
p-senichenkov cb4e8cb
Normalize includes in base model classes, remove comparer
p-senichenkov bbb84d5
Fix bindigns tests
p-senichenkov 8f42dc2
Some minor changes to PAC verifier (related to min_delta)
p-senichenkov dac4e49
Do not try to take prev(end) if end == begin in Domain PAC verifier
p-senichenkov 3fc4341
Actualize examples basic-1 and basic-2
p-senichenkov b0d00ad
Apply review suggestions (examples)
p-senichenkov 4777f39
Apply review suggestions (base model classes)
p-senichenkov 00466ad
Apply review suggestions (PAC verifier)
p-senichenkov 41f12d8
Apply review suggestions (Domain PAC verifier)
p-senichenkov f43b2c2
Remove unneded names, edit descriptions
p-senichenkov 63f0963
Edit Domain PAC tests
p-senichenkov 859b647
Edit Domain PAC bindings
p-senichenkov 53293e6
Actualize examples basic-3, basic-4, advanced
p-senichenkov f83c8ed
Fix formatting in PAC verifier
p-senichenkov 5e41dfd
Fix formatting in bindings
p-senichenkov bf35fc4
Update snapshots for examples tests
p-senichenkov fa17fc4
Fix formatting in PAC verifier
p-senichenkov cca8191
Fix formatting in tests
p-senichenkov 279d71f
Do not use epsilon end delta in Domain PAC __hash__
p-senichenkov a0fd9d6
Fix examples
p-senichenkov ab7f699
Migrate to target-based CMake
p-senichenkov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
129 changes: 129 additions & 0 deletions
129
examples/advanced/verifying_pac/verifying_domain_pac_custom_domain.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| '''Example 1 (advanced): Custom domain''' | ||
|
|
||
| from tabulate import tabulate | ||
| from csv import reader | ||
| from math import sqrt | ||
|
|
||
| import desbordante | ||
|
|
||
| RED = '\033[31m' | ||
| GREEN = '\033[32m' | ||
| BLUE = '\033[34m' | ||
| CYAN = '\033[36m' | ||
| BOLD = '\033[1;37m' | ||
| ENDC = '\033[0m' | ||
|
|
||
| USER_PREFERENCES = 'examples/datasets/verifying_pac/user_preferences.csv' | ||
|
|
||
|
|
||
| def csv_to_str(filename: str) -> str: | ||
| with open(filename, newline='') as table: | ||
| rows = list(reader(table, delimiter=',')) | ||
| headers = rows[0] | ||
| rows = rows[1:] | ||
| return tabulate(rows, headers=headers) | ||
|
|
||
|
|
||
| print( | ||
| f'''This example illustrates the usage of Domain Probabilistic Approximate Constraints (Domain PACs). | ||
| A Domain PAC on a column set X and domain D, with given ε and δ means that {BOLD}Pr(x ∈ D±ε) ≥ δ{ENDC}. | ||
| For more information, see "Checks and Balances: Monitoring Data Quality Problems in Network | ||
| Traffic Databases" by Filp Korn et al (Proceedings of the 29th VLDB Conference, Berlin, 2003). | ||
| If you have not read the basic Domain PAC examples yet (see the {CYAN}examples/basic/verifying_pac/{ENDC} | ||
| directory), it is recommended to start there. | ||
|
|
||
| Assume we have a dataset of user preferences, where each user\'s interest in several topics is | ||
| encoded as values in [0, 1], where 0 is "not interested at all" and 1 is "very interested": | ||
| {BOLD}{csv_to_str(USER_PREFERENCES)}{ENDC} | ||
|
|
||
| We need to estimate whether this group of users will be interested in the original Domain PAC paper | ||
| ("Checks and Balances: ..."). | ||
| To do this, we represent each user profile as a vector in a multi-dimensional topic space: | ||
| ^ Topic 2 | ||
| | | ||
| | user | ||
| | x | ||
| | / | ||
| |/ Topic 1 | ||
| -+-------> | ||
| | | ||
|
|
||
| A "perfect" target reader might have the profile: {BLUE}(0.9, 0.4, 0.05){ENDC}. | ||
| This corresponds to: | ||
| * high interest in Databases; | ||
| * moderate interest in Networks; | ||
| * low interest in Machine Learning. | ||
| Our goal is to measure how close real users are to this ideal profile. | ||
|
|
||
| We use cosine distance, which measures the angle between two vectors rather then their absolute | ||
| length. This is useful because we care about interest proportions, not total magnitude. | ||
| {BOLD}dist(x, y) = 1 - cos(angle between x and y) = 1 - (x, y)/(|x| * |y|){ENDC}, | ||
| where (x, y) is a dot product between x and y. | ||
| ''') | ||
|
|
||
|
|
||
| def cosine_dist(x: list[float], y: list[float]) -> float: | ||
| dot_product = 0 | ||
| x_length = 0 | ||
| y_length = 0 | ||
| for i in range(len(x)): | ||
| dot_product += x[i] * y[i] | ||
| x_length += x[i] * x[i] | ||
| y_length += y[i] * y[i] | ||
| x_length = sqrt(x_length) | ||
| y_length = sqrt(y_length) | ||
| return 1 - dot_product / (x_length * y_length) | ||
|
|
||
|
|
||
| print(f'''A custom domain is defined by two parameters: | ||
| 1. Distance function -- takes a value tuple and returns the distance to the domain. | ||
| 2. Domain name (optional) -- used for readable output. | ||
| In this example: | ||
| * distance function: {BLUE}dist(x, (0.9, 0.4, 0.05)){ENDC}; | ||
| * domain name: {BLUE}"(0.9, 0.4, 0.05)"{ENDC}. | ||
| This effectively defines the domain as "users close to the ideal profile". | ||
| ''') | ||
|
|
||
| PERFECT_USER = [0.9, 0.4, 0.05] | ||
|
|
||
|
|
||
| # Argument is always a list of strings | ||
| def dist_from_domain(x: list[str]) -> float: | ||
| x_f = [float(x_i) for x_i in x] | ||
| return cosine_dist(x_f, PERFECT_USER) | ||
|
|
||
|
|
||
| domain = desbordante.pac.domains.CustomDomain(dist_from_domain, | ||
| "(0.9, 0.4, 0.05)") | ||
|
|
||
| print(f'We run the Domain PAC verifier with domain={BLUE}{domain}{ENDC}.') | ||
| algo = desbordante.pac_verification.algorithms.DomainPACVerifier() | ||
| algo.load_data(table=(USER_PREFERENCES, ',', True), | ||
| domain=domain, | ||
| column_indices=[0, 1, 2]) | ||
| algo.execute() | ||
| pac_1 = algo.get_pac() | ||
| print(f'''Algorithm result: | ||
| {GREEN}{pac_1}{ENDC} | ||
| Now we lower the required probability threshold: min_delta={BLUE}0.6{ENDC}.''') | ||
|
|
||
| algo.execute(min_delta=0.6) | ||
| pac_2 = algo.get_pac() | ||
|
|
||
| print(f'''Algorithm result: | ||
| {GREEN}{pac_2}{ENDC} | ||
| Interpretation: | ||
| * With a larger ε ({BLUE}{pac_1.epsilon:.3f}{ENDC}), nearly all users show some level of interest | ||
| * With a very small ε ({BLUE}{pac_2.epsilon:.3f}{ENDC}), only {BLUE}{pac_2.delta * 100.0:.0f}%{ENDC} of users closely match the ideal reader. | ||
|
|
||
| You can check outliers to identify which users are closer to or farther from the ideal profile. | ||
| For an introduction to outliers, see {CYAN}examples/basic/verifying_pac/verifying_domain_pac1.py{ENDC}. | ||
|
|
||
| You now know how to use Domain PACs with built-in domains as well as custom ones. | ||
| Try applying them to your own data and see what insights you can uncover.''' | ||
| ) | ||
|
|
||
| # C++ note: Custom domain is called "Untyped domain" in C++ code, because it erases type | ||
| # information, converting all values to strings. If you use C++ library, it's recommended to | ||
| # implement IDomain interface or derive from MetricBasedDomain (if your domain is based on | ||
| # coordinate-wise metrics). See Parallelepiped and Ball implementations as examples. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| '''Example 1: 1D segment, highlights''' | ||
|
|
||
| from tabulate import tabulate | ||
| from csv import reader | ||
|
|
||
| import desbordante | ||
|
|
||
| RED = '\033[31m' | ||
| YELLOW = '\033[33m' | ||
| BOLD_YELLOW = '\033[1;33m' | ||
| GREEN = '\033[32m' | ||
| BLUE = '\033[34m' | ||
| CYAN = '\033[36m' | ||
| BOLD = '\033[1;37m' | ||
| ENDC = '\033[0m' | ||
|
|
||
| ENGINE_TEMPS_BAD = 'examples/datasets/verifying_pac/engine_temps_bad.csv' | ||
| ENGINE_TEMPS_GOOD = 'examples/datasets/verifying_pac/engine_temps_good.csv' | ||
|
|
||
|
|
||
| def read_column(filename: str, col_num: int) -> (str, list[str]): | ||
| with open(filename, newline='') as table: | ||
| rows = list(reader(table, delimiter=',')) | ||
| header = rows[0][col_num] | ||
| values = [row[col_num] for row in rows[1:]] | ||
| return header, values | ||
|
|
||
|
|
||
| def column_to_str(filename: str, col_num: int) -> str: | ||
| header, values = read_column(filename, col_num) | ||
| values_str = ', '.join(values) | ||
| return f'{BOLD}{header}: [{values_str}]{ENDC}' | ||
|
|
||
|
|
||
| def display_columns_diff(filename_old: str, col_num_old: int, | ||
| filename_new: str, col_num_new: int) -> str: | ||
| _, values_old = read_column(filename_old, col_num_old) | ||
| header, values_new = read_column(filename_new, col_num_new) | ||
| values = [] | ||
| for i in range(len(values_new)): | ||
| value = values_new[i] | ||
| if values_old[i] != value: | ||
| value = f'{BOLD_YELLOW}' + value + f'{BOLD}' | ||
| values.append(value) | ||
| values_str = ', '.join(values) | ||
| return f'{BOLD}{header}: [{values_str}]{ENDC}' | ||
|
|
||
|
|
||
| print( | ||
| f'''This example illustrates the usage of Domain Probabilistic Approximate Constraints (PACs). | ||
| A Domain PAC on column set X and domain D, with given ε and δ means that Pr(x ∈ D±ε) ≥ δ. | ||
| For more information consult "Checks and Balances: Monitoring Data Quality Problems in Network | ||
| Traffic Databases" by Flip Korn et al (Proceedings of the 29th VLDB Conference, Berlin, 2003). | ||
|
|
||
| This is the first example in the "Basic Domain PAC verification" series. Others can be found in | ||
| {CYAN}examples/basic/verifying_pac/{ENDC} directory. | ||
| ''') | ||
|
|
||
| print( | ||
| f'''Suppose we are working on a new model of engine. Its operating temperature range is {BLUE}[85, 95]{ENDC}°C. | ||
| The engine is made of high-strength metal, so short-term temperature deviations are acceptable and | ||
| will not cause immediate damage. In other words, engine operates properly when Pr(t ∈ [85, 95]±ε) ≥ δ. | ||
| Based on engineering analysis, the acceptable limits are: ε = {BLUE}5{ENDC}, δ = {BLUE}0.9{ENDC}. | ||
|
|
||
| In terms of Domain PACs, the following constraint should hold: {BLUE}Pr(x ∈ [85, 95]±5) ≥ 0.9{ENDC}. | ||
| ''') | ||
|
|
||
| print( | ||
| 'The following table contains readings from the engine temperature sensor:' | ||
| ) | ||
| # Values are printed in one line for brevity, original table is single-column | ||
| print(f'{column_to_str(ENGINE_TEMPS_BAD, 0)}') | ||
| print() | ||
|
|
||
| print( | ||
| 'We now use the Domain PAC verifier to determine whether the engine is operating safely.' | ||
| ) | ||
|
|
||
| print( | ||
| f'''First, we need to define the domain. Available options are: | ||
| * {BLUE}Parallelepiped{ENDC} -- a closed n-ary parallelepiped | ||
| * {BLUE}Ball{ENDC} -- a closed n-ary ball | ||
| * {BLUE}CustomDomain{ENDC} -- a domain with user-defined metric | ||
| A segment is simply a one-dimensional parallelepiped, so we use the {BLUE}Parallelepiped{ENDC} domain here.''') | ||
| # Parallelepiped has a special constructor for segment. | ||
| # Notice the usage of quotes: these strings will be converted to values once the table is loaded. | ||
| segment = desbordante.pac.domains.Parallelepiped('85', '95') | ||
|
|
||
| print( | ||
| f'''We run algorithm with the following options: domain={BLUE}{segment}{ENDC}. All other parameters use default | ||
| values: min_epsilon={BLUE}0{ENDC}, max_epsilon={BLUE}∞{ENDC}, min_delta={BLUE}0.9{ENDC}, delta_steps={BLUE}100{ENDC}. | ||
| ''') | ||
|
|
||
| algo = desbordante.pac_verification.algorithms.DomainPACVerifier() | ||
| # Note that domain should be set in `load_data`, not `execute` | ||
| algo.load_data(table=(ENGINE_TEMPS_BAD, ',', True), | ||
| column_indices=[0], | ||
| domain=segment) | ||
| algo.execute() | ||
|
|
||
| print(f'Algorithm result: {YELLOW}{algo.get_pac()}{ENDC}.\n') | ||
| print( | ||
| f'''This result is not directly informative for our goal. Since both ε and δ exceed the required values, | ||
| we cannot determine whether the constraint holds for ε={BLUE}5{ENDC} and δ={BLUE}0.9{ENDC}. | ||
|
|
||
| Let\'s run algorithm with min_epsilon={BLUE}5{ENDC} and max_epsilon={BLUE}5{ENDC}. This will give us the exact δ, | ||
| for which PAC with ε={BLUE}5{ENDC} holds. | ||
| ''') | ||
|
|
||
| # Note that, when min_epsilon or max_epsilon is specified, default min_delta becomes 0 | ||
| algo.execute(min_epsilon=5, max_epsilon=5) | ||
|
|
||
| print(f'Algorithm result: {RED}{algo.get_pac()}{ENDC}.\n') | ||
| print( | ||
| f'''Also, let\'s run algorithm with max_epsilon={BLUE}0{ENDC} and min_delta={BLUE}0.9{ENDC} to check which ε | ||
| is needed to satisfy δ={BLUE}0.9{ENDC}. With these parameters algorithm enters special mode and returns | ||
| pair (ε, min_delta), so that we can validate PAC with the given δ. | ||
| ''') | ||
|
|
||
| # Actually, algorithm enters this mode whenever max_epsilon is less than epsilon needed to satisfy | ||
| # min_delta. | ||
| algo.execute(max_epsilon=0, min_delta=0.9) | ||
|
|
||
| pac = algo.get_pac() | ||
| print(f'Algorithm result: {RED}{pac}{ENDC}.\n') | ||
| print( | ||
| f'''Here algorithm gives δ={BLUE}{pac.delta}{ENDC}, which is greater than {BLUE}0.9{ENDC}, because achieving δ={BLUE}0.9{ENDC} requires | ||
| ε={BLUE}{pac.epsilon}{ENDC} and PAC ({BLUE}{pac.epsilon}{ENDC}, {BLUE}{pac.delta}{ENDC}) holds. So, this means that δ={BLUE}0.9{ENDC} would also require ε={BLUE}{pac.epsilon}{ENDC}. | ||
| ''') | ||
|
|
||
| print( | ||
| 'We can see that desired PAC doesn\'t hold, so the engine can blow up!\n') | ||
|
|
||
| print( | ||
| f'''Let\'s look at values violating PAC. Domain PAC verifier can detect values between eps_1 | ||
| and eps_2, i. e. values that lie in D±eps_2 \\ D±eps_1. Such values are called outliers (or highlights). | ||
| Let\'s find outliers for different eps_1, eps_2 values:''') | ||
|
|
||
| value_ranges = [(0, 1), (1, 2), (2, 3), (3, 5), (5, 7), (7, 10)] | ||
| highlights_table = [(f'{BLUE}{v_range[0]}{ENDC}', f'{BLUE}{v_range[1]}{ENDC}', | ||
| str(algo.get_highlights(*v_range))) | ||
| for v_range in value_ranges] | ||
| print(tabulate(highlights_table, headers=('eps_1', 'eps_2', 'outliers'))) | ||
| print() | ||
|
|
||
| print('''We can see two problems: | ||
| 1. The engine operated at low temperatures for an extended period, slightly below 80°C. | ||
| 2. The peak temperature was too high, but this occurred only once.\n''') | ||
|
|
||
| print('''The second version of engine has: | ||
| 1. A pre-heating system to prevent operation at low temperatures. | ||
| 2. An emergency cooling system to limit peak temperatures. | ||
| The updated sensor readings (modified values highlighted) are:''') | ||
| print(f'{display_columns_diff(ENGINE_TEMPS_BAD, 0, ENGINE_TEMPS_GOOD, 0)}') | ||
| print() | ||
|
|
||
| print(f'''We run the Domain PAC verifier again.''') | ||
| algo = desbordante.pac_verification.algorithms.DomainPACVerifier() | ||
| algo.load_data(table=(ENGINE_TEMPS_GOOD, ',', True), | ||
| column_indices=[0], | ||
| domain=segment) | ||
| algo.execute() | ||
|
|
||
| print(f'''Algorithm result: {GREEN}{algo.get_pac()}{ENDC}. | ||
|
|
||
| The desired PAC now holds, which means the improved engine operates within acceptable limits. | ||
|
|
||
| It is recommended to continue with the second example ({CYAN}examples/basic/verifying_pac/verifying_domain_pac2.py{ENDC}), | ||
| which demonstrates more advanced usage of the Parallelepiped domain.''') |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.