Skip to content

Commit 9a0126f

Browse files
authored
Add PyDough metadata for masked table columns (#401)
Adds a new metadata type `masked table column`, a subclass of `table column` representing data that is stored in the underlying data in some kind of encrypted manner but where the metadata stores additional information such as how to decrypt the data. These new metadata column types have the following additional properties: - `unprotect protocol`: a Python format string containing SQL text that, when the value is injected, indicates how to decrypt the data. - `protect protocol`: a Python format string containing SQL text that, when a value is injected, indicates how to encrypt the value in the same manner as the column was encrypted, so the unprotect protocol will reverse it. - `server masked`: optional boolean (default False) which, if True, indicates that information about the encryption/decryption scheme is available in a server that can be queried to rewrite/optimize predicates to avoid unmasking. For example, suppose a string column `name` is "encrypted" by uppercasing it, and "decrypted" by lowercasing it, and does not have a server. This would be the following metadata: ```json { "name": "name", "type": "masked table column", "column name": "c_name", "data type": "string", "unprotect protocol": "UPPER({})", "protect protocol": "LOWER({})", "server masked": false } ``` So for instance, when reading the data from the table, instead of placing `c_name` in the SELECT clause we would place `LOWER(c_name)` in the SELECT clause so it is "decrypted" (since the data int he table was uppercase). Conversely, if we wish to do a filter on `name == "john smith"`, instead of doing `LOWER(c_name) == "john smith"` we could use the protect protocol to rewrite as `c_name == UPPER("john smith")`.
1 parent ab44eb9 commit 9a0126f

17 files changed

+627
-142
lines changed

documentation/metadata.md

Lines changed: 51 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ This page document the exact format that the JSON files containing PyDough metad
99
* [Collection Type: Simple Table](#collection-type-simple-table)
1010
- [Properties](#properties)
1111
* [Property Type: Table Column](#property-type-table-column)
12+
* [Property Type: Masked Table Column](#property-type-masked-table-column)
1213
- [Relationships](#relationships)
1314
* [Relationship Type: Simple Join](#relationship-type-simple-join)
1415
* [Relationship Type: General Join](#relationship-type-general-join)
@@ -136,15 +137,63 @@ Example of the structure of the metadata for a table column property:
136137
{
137138
"name": "account balance",
138139
"type": "table column",
139-
"column_name": "ba_bal",
140-
"data_type": "numeric",
140+
"column name": "ba_bal",
141+
"data type": "numeric",
141142
"description": "The amount of money currently in the account",
142143
"sample values": [0.0, 123.45, 999864.00],
143144
"synonyms": ["amount", "value", "balance"],
144145
"extra semantic info": {...}
145146
}
146147
```
147148

149+
<!-- TOC --><a name="property-type-masked-table-column"></a>
150+
### Property Type: Masked Table Column
151+
152+
A property with this type is the same as a regular table column, except the data in the underlying table has been masked using an encryption protocol. The metadata includes the information required to either encrypt values in SQL using the same masking protocol, or unmask previously masked data in SQL queries.
153+
154+
Properties of this type use the type string "masked table column" and include all the properties from [table column](#property-type-table-column), plus the following additional key-value pairs in their metadata JSON object:
155+
156+
- `unprotect protocol` (required): a Python format string representing the SQL text that is used to unmask the data after reading it from the underlying table. The format string should expect a single placeholder value (e.g. `"SUBSTRING({0}, -1) || SUBSTRING({0}, 1, LENGTH({0}) - 1)".format("c_name")` will generate the SQL text `SUBSTRING(c_name, -1) || SUBSTRING(c_name, 1, LENGTH(c_name) - 1)`).
157+
- `protect protocol` (required): a Python format string, in the same format as `unprotect protocol`, used to describe how the data was originally masked. This can be used to generate masked values consistent with the encryption scheme, allowing operations such as comparisons between masked data.
158+
- `protected data type` (optional): same as `data type`, except referring to the type of the data when it is protected, whereas `data type` refers to the raw unprotected column. If omitted, it is assumed that the data type is the same between the unprotected vs protected data.
159+
- `server masked` (optional): a boolean flag indicating whether the column was masked on a server that is attached to PyDough. If `true`, PyDough can use it to optimize queries by rewriting predicates and expressions to avoid unmasking the data.
160+
161+
Example of the structure of the metadata for a masked table column property where the string data is masked by moving the first character to the end, and unmasked by moving it back to the beginning:
162+
163+
```json
164+
{
165+
"name": "name",
166+
"type": "masked table column",
167+
"column name": "c_name",
168+
"data type": "string",
169+
"unprotect protocol": "SUBSTRING({0}, -1) || SUBSTRING({0}, 1, LENGTH({0}) - 1)",
170+
"protect protocol": "SUBSTRING({0}, 2) || SUBSTRING({0}, 1, 1)",
171+
"description": "The name of the customer",
172+
"sample values": ["John Smith", "Adrien Lee", "Anna Rodriguez"],
173+
"synonyms": ["full name"],
174+
"extra semantic info": {...}
175+
}
176+
```
177+
178+
Another example of the structure of the metadata for a masked table column property where the numeric is masked by converting it to a string, switching the `0` digits with asterisks, and left-padding to length 10 with asterisks, then unmasked by reversing the process:
179+
180+
```json
181+
{
182+
"name": "account_id",
183+
"type": "masked table column",
184+
"column name": "a_id",
185+
"data type": "numeric",
186+
"protected data type": "string",
187+
"unprotect protocol": "INTEGER(REPLACE({0}, '*', '0'))",
188+
"protect protocol": "LPAD(REPLACE(STRING({0}), '0', '*'), 10, '*')",
189+
"server masked": false,
190+
"description": "The id of the bank account",
191+
"sample values": [12030061, 4000013, 560003],
192+
"synonyms": ["account key", "bank account number"],
193+
"extra semantic info": {...}
194+
}
195+
```
196+
148197
<!-- TOC --><a name="relationships"></a>
149198
## Relationships
150199

pydough/metadata/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
"CollectionMetadata",
88
"GeneralJoinMetadata",
99
"GraphMetadata",
10+
"MaskedTableColumnMetadata",
1011
"PropertyMetadata",
1112
"PyDoughMetadataException",
1213
"SimpleJoinMetadata",
@@ -23,6 +24,7 @@
2324
from .properties import (
2425
CartesianProductMetadata,
2526
GeneralJoinMetadata,
27+
MaskedTableColumnMetadata,
2628
PropertyMetadata,
2729
SimpleJoinMetadata,
2830
SubcollectionRelationshipMetadata,

pydough/metadata/abstract_metadata.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
from abc import ABC, abstractmethod
88

9+
from .errors import extract_array, extract_object, extract_string
10+
911

1012
class AbstractMetadata(ABC):
1113
"""
@@ -28,6 +30,26 @@ def __init__(
2830
self._synonyms: list[str] | None = synonyms
2931
self._extra_semantic_info: dict | None = extra_semantic_info
3032

33+
def parse_optional_properties(self, meta_json: dict) -> None:
34+
"""
35+
Parse the optional metadata fields from a JSON object describing
36+
the metadata to fill the description / synonyms / extra semantic info
37+
fields of the metadata object.
38+
39+
Args:
40+
`meta_json`: the JSON object describing the metadata.
41+
"""
42+
if "description" in meta_json:
43+
self._description = extract_string(
44+
meta_json, "description", self.error_name
45+
)
46+
if "synonyms" in meta_json:
47+
self._synonyms = extract_array(meta_json, "synonyms", self.error_name)
48+
if "extra semantic info" in meta_json:
49+
self._extra_semantic_info = extract_object(
50+
meta_json, "extra semantic info", self.error_name
51+
)
52+
3153
@property
3254
@abstractmethod
3355
def error_name(self) -> str:

pydough/metadata/collections/collection_metadata.py

Lines changed: 5 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -213,36 +213,6 @@ def get_property(self, property_name: str) -> AbstractMetadata:
213213
def __getitem__(self, key: str):
214214
return self.get_property(key)
215215

216-
@staticmethod
217-
def get_class_for_collection_type(
218-
name: str, error_name: str
219-
) -> type["CollectionMetadata"]:
220-
"""
221-
Fetches the PropertyType implementation class for a string
222-
representation of the collection type.
223-
224-
Args:
225-
`name`: the string representation of a collection type.
226-
`error_name`: the string used in error messages to describe
227-
the object that `name` came from.
228-
229-
Returns:
230-
The class of the property type corresponding to `name`.
231-
232-
Raises:
233-
`PyDoughMetadataException` if the string does not correspond
234-
to a known class type.
235-
"""
236-
from .simple_table_metadata import SimpleTableMetadata
237-
238-
match name:
239-
case "simple_table":
240-
return SimpleTableMetadata
241-
case property_type:
242-
raise PyDoughMetadataException(
243-
f"Unrecognized collection type for {error_name}: {repr(property_type)}"
244-
)
245-
246216
def add_properties_from_json(self, properties_json: list) -> None:
247217
"""
248218
Insert the scalar properties from the JSON for collection into the
@@ -253,7 +223,7 @@ def add_properties_from_json(self, properties_json: list) -> None:
253223
scalar property that should be parsed and inserted into the
254224
collection.
255225
"""
256-
from pydough.metadata.properties import TableColumnMetadata
226+
from pydough.metadata import MaskedTableColumnMetadata, TableColumnMetadata
257227

258228
for property_json in properties_json:
259229
# Extract the name/type, and create the string used to identify
@@ -269,6 +239,10 @@ def add_properties_from_json(self, properties_json: list) -> None:
269239
TableColumnMetadata.parse_from_json(
270240
self, property_name, property_json
271241
)
242+
case "masked table column":
243+
MaskedTableColumnMetadata.parse_from_json(
244+
self, property_name, property_json
245+
)
272246
case _:
273247
raise PyDoughMetadataException(
274248
f"Unrecognized property type {property_type!r} for {error_name}"

pydough/metadata/collections/simple_table_metadata.py

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
NoExtraKeys,
99
PyDoughMetadataException,
1010
extract_array,
11-
extract_object,
1211
extract_string,
1312
is_string,
1413
unique_properties_predicate,
@@ -41,9 +40,9 @@ def __init__(
4140
graph,
4241
table_path: str,
4342
unique_properties: list[str | list[str]],
44-
description: str | None,
45-
synonyms: list[str] | None,
46-
extra_semantic_info: dict | None,
43+
description: str | None = None,
44+
synonyms: list[str] | None = None,
45+
extra_semantic_info: dict | None = None,
4746
):
4847
super().__init__(name, graph, description, synonyms, extra_semantic_info)
4948
is_string.verify(table_path, f"Property 'table_path' of {self.error_name}")
@@ -151,18 +150,6 @@ def parse_from_json(
151150
collection_json, error_name
152151
)
153152
unique_properties: list[str | list[str]] = collection_json["unique properties"]
154-
# Extract the optional fields from the JSON.
155-
description: str | None = None
156-
synonyms: list[str] | None = None
157-
extra_semantic_info: dict | None = None
158-
if "description" in collection_json:
159-
description = extract_string(collection_json, "description", error_name)
160-
if "synonyms" in collection_json:
161-
synonyms = extract_array(collection_json, "synonyms", error_name)
162-
if "extra semantic info" in collection_json:
163-
extra_semantic_info = extract_object(
164-
collection_json, "extra semantic info", error_name
165-
)
166153
NoExtraKeys(SimpleTableMetadata.allowed_fields).verify(
167154
collection_json, error_name
168155
)
@@ -171,10 +158,9 @@ def parse_from_json(
171158
graph,
172159
table_path,
173160
unique_properties,
174-
description,
175-
synonyms,
176-
extra_semantic_info,
177161
)
162+
# Parse the optional common semantic properties like the description.
163+
new_collection.parse_optional_properties(collection_json)
178164
properties: list = extract_array(
179165
collection_json, "properties", new_collection.error_name
180166
)

pydough/metadata/graphs/graph_metadata.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,9 @@ def __init__(
4848
self._name: str = name
4949
self._collections: dict[str, AbstractMetadata] = {}
5050
self._functions: dict[str, ExpressionFunctionOperator] = {}
51-
super().__init__(description, synonyms, extra_semantic_info)
51+
self._description = description
52+
self._synonyms = synonyms
53+
self._extra_semantic_info = extra_semantic_info
5254

5355
@property
5456
def name(self) -> str:

pydough/metadata/properties/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ The properties classes in PyDough follow a hierarchy that includes both abstract
2020
- [`PropertyMetadata`](property_metadata.py) (abstract): Base class for all property metadata.
2121
- [`ScalarAttributeMetadata`](scalar_attribute_metadata.py) (abstract): Base class for properties that are scalars within each record of a collection.
2222
- [`TableColumnMetadata`](table_column_metadata.py) (concrete): Represents a column of data from a relational table.
23+
- [`MaskedTableColumnMetadata`](masked_table_column_metadata.py) (concrete): Represents a variant of a TableColumnMetadata where the data in the table has been encrypted by a masking protocol but the metadata stores information about that protocol, including how to unmask it when reading the data from the table.
2324
- [`SubcollectionRelationshipMetadata`](subcollection_relationship_metadata.py) (abstract): Base class for properties that map to a subcollection of a collection.
2425
- [`ReversiblePropertyMetadata`](reversible_property_metadata.py) (abstract): Base class for properties that map to a subcollection and have a corresponding reverse relationship.
2526
- [`CartesianProductMetadata`](cartesian_product_metadata.py) (concrete): Represents a cartesian product between a collection and its subcollection.

pydough/metadata/properties/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
__all__ = [
66
"CartesianProductMetadata",
77
"GeneralJoinMetadata",
8+
"MaskedTableColumnMetadata",
89
"PropertyMetadata",
910
"ReversiblePropertyMetadata",
1011
"ScalarAttributeMetadata",
@@ -15,6 +16,7 @@
1516

1617
from .cartesian_product_metadata import CartesianProductMetadata
1718
from .general_join_metadata import GeneralJoinMetadata
19+
from .masked_table_column_metadata import MaskedTableColumnMetadata
1820
from .property_metadata import PropertyMetadata
1921
from .reversible_property_metadata import ReversiblePropertyMetadata
2022
from .scalar_attribute_metadata import ScalarAttributeMetadata

pydough/metadata/properties/cartesian_product_metadata.py

Lines changed: 6 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,7 @@
99
from pydough.metadata.collections import CollectionMetadata
1010
from pydough.metadata.errors import (
1111
NoExtraKeys,
12-
extract_array,
1312
extract_bool,
14-
extract_object,
1513
extract_string,
1614
)
1715
from pydough.metadata.graphs import GraphMetadata
@@ -40,9 +38,9 @@ def __init__(
4038
parent_collection: CollectionMetadata,
4139
child_collection: CollectionMetadata,
4240
always_matches: bool,
43-
description: str | None,
44-
synonyms: list[str] | None,
45-
extra_semantic_info: dict | None,
41+
description: str | None = None,
42+
synonyms: list[str] | None = None,
43+
extra_semantic_info: dict | None = None,
4644
):
4745
super().__init__(
4846
name,
@@ -56,7 +54,7 @@ def __init__(
5654
)
5755

5856
@staticmethod
59-
def create_error_name(name: str, collection_error_name: str):
57+
def create_error_name(name: str, collection_error_name: str) -> str:
6058
return f"cartesian property {name!r} of {collection_error_name}"
6159

6260
@property
@@ -108,19 +106,6 @@ def parse_from_json(
108106
if "always matches" in property_json:
109107
always_matches = extract_bool(property_json, "always matches", error_name)
110108

111-
# Extract the optional fields from the JSON object.
112-
description: str | None = None
113-
synonyms: list[str] | None = None
114-
extra_semantic_info: dict | None = None
115-
if "description" in property_json:
116-
description = extract_string(property_json, "description", error_name)
117-
if "synonyms" in property_json:
118-
synonyms = extract_array(property_json, "synonyms", error_name)
119-
if "extra semantic info" in property_json:
120-
extra_semantic_info = extract_object(
121-
property_json, "extra semantic info", error_name
122-
)
123-
124109
NoExtraKeys(CartesianProductMetadata.allowed_fields).verify(
125110
property_json, error_name
126111
)
@@ -131,10 +116,9 @@ def parse_from_json(
131116
parent_collection,
132117
child_collection,
133118
always_matches,
134-
description,
135-
synonyms,
136-
extra_semantic_info,
137119
)
120+
# Parse the optional common semantic properties like the description.
121+
property.parse_optional_properties(property_json)
138122
parent_collection.add_property(property)
139123

140124
def build_reverse_relationship(

pydough/metadata/properties/general_join_metadata.py

Lines changed: 6 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,7 @@
99
from pydough.metadata.collections import CollectionMetadata
1010
from pydough.metadata.errors import (
1111
NoExtraKeys,
12-
extract_array,
1312
extract_bool,
14-
extract_object,
1513
extract_string,
1614
)
1715
from pydough.metadata.graphs import GraphMetadata
@@ -48,9 +46,9 @@ def __init__(
4846
condition: str,
4947
self_name: str,
5048
other_name: str,
51-
description: str | None,
52-
synonyms: list[str] | None,
53-
extra_semantic_info: dict | None,
49+
description: str | None = None,
50+
synonyms: list[str] | None = None,
51+
extra_semantic_info: dict | None = None,
5452
):
5553
super().__init__(
5654
name,
@@ -99,7 +97,7 @@ def components(self) -> list:
9997
return comp
10098

10199
@staticmethod
102-
def create_error_name(name: str, collection_error_name: str):
100+
def create_error_name(name: str, collection_error_name: str) -> str:
103101
return f"general join property {name!r} of {collection_error_name}"
104102

105103
@staticmethod
@@ -155,18 +153,6 @@ def parse_from_json(
155153
always_matches = extract_bool(property_json, "always matches", error_name)
156154
condition = extract_string(property_json, "condition", error_name)
157155

158-
# Extract the optional fields from the JSON object.
159-
description: str | None = None
160-
synonyms: list[str] | None = None
161-
extra_semantic_info: dict | None = None
162-
if "description" in property_json:
163-
description = extract_string(property_json, "description", error_name)
164-
if "synonyms" in property_json:
165-
synonyms = extract_array(property_json, "synonyms", error_name)
166-
if "extra semantic info" in property_json:
167-
extra_semantic_info = extract_object(
168-
property_json, "extra semantic info", error_name
169-
)
170156
NoExtraKeys(GeneralJoinMetadata.allowed_fields).verify(
171157
property_json, error_name
172158
)
@@ -182,10 +168,9 @@ def parse_from_json(
182168
condition,
183169
"self",
184170
"other",
185-
description,
186-
synonyms,
187-
extra_semantic_info,
188171
)
172+
# Parse the optional common semantic properties like the description.
173+
property.parse_optional_properties(property_json)
189174
parent_collection.add_property(property)
190175

191176
def build_reverse_relationship(

0 commit comments

Comments
 (0)