You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add PyDough metadata for masked table columns (#401)
Adds a new metadata type `masked table column`, a subclass of `table column` representing data that is stored in the underlying data in some kind of encrypted manner but where the metadata stores additional information such as how to decrypt the data. These new metadata column types have the following additional properties:
- `unprotect protocol`: a Python format string containing SQL text that, when the value is injected, indicates how to decrypt the data.
- `protect protocol`: a Python format string containing SQL text that, when a value is injected, indicates how to encrypt the value in the same manner as the column was encrypted, so the unprotect protocol will reverse it.
- `server masked`: optional boolean (default False) which, if True, indicates that information about the encryption/decryption scheme is available in a server that can be queried to rewrite/optimize predicates to avoid unmasking.
For example, suppose a string column `name` is "encrypted" by uppercasing it, and "decrypted" by lowercasing it, and does not have a server. This would be the following metadata:
```json
{
"name": "name",
"type": "masked table column",
"column name": "c_name",
"data type": "string",
"unprotect protocol": "UPPER({})",
"protect protocol": "LOWER({})",
"server masked": false
}
```
So for instance, when reading the data from the table, instead of placing `c_name` in the SELECT clause we would place `LOWER(c_name)` in the SELECT clause so it is "decrypted" (since the data int he table was uppercase). Conversely, if we wish to do a filter on `name == "john smith"`, instead of doing `LOWER(c_name) == "john smith"` we could use the protect protocol to rewrite as `c_name == UPPER("john smith")`.
A property with this type is the same as a regular table column, except the data in the underlying table has been masked using an encryption protocol. The metadata includes the information required to either encrypt values in SQL using the same masking protocol, or unmask previously masked data in SQL queries.
153
+
154
+
Properties of this type use the type string "masked table column" and include all the properties from [table column](#property-type-table-column), plus the following additional key-value pairs in their metadata JSON object:
155
+
156
+
-`unprotect protocol` (required): a Python format string representing the SQL text that is used to unmask the data after reading it from the underlying table. The format string should expect a single placeholder value (e.g. `"SUBSTRING({0}, -1) || SUBSTRING({0}, 1, LENGTH({0}) - 1)".format("c_name")` will generate the SQL text `SUBSTRING(c_name, -1) || SUBSTRING(c_name, 1, LENGTH(c_name) - 1)`).
157
+
-`protect protocol` (required): a Python format string, in the same format as `unprotect protocol`, used to describe how the data was originally masked. This can be used to generate masked values consistent with the encryption scheme, allowing operations such as comparisons between masked data.
158
+
-`protected data type` (optional): same as `data type`, except referring to the type of the data when it is protected, whereas `data type` refers to the raw unprotected column. If omitted, it is assumed that the data type is the same between the unprotected vs protected data.
159
+
-`server masked` (optional): a boolean flag indicating whether the column was masked on a server that is attached to PyDough. If `true`, PyDough can use it to optimize queries by rewriting predicates and expressions to avoid unmasking the data.
160
+
161
+
Example of the structure of the metadata for a masked table column property where the string data is masked by moving the first character to the end, and unmasked by moving it back to the beginning:
Another example of the structure of the metadata for a masked table column property where the numeric is masked by converting it to a string, switching the `0` digits with asterisks, and left-padding to length 10 with asterisks, then unmasked by reversing the process:
Copy file name to clipboardExpand all lines: pydough/metadata/properties/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,7 @@ The properties classes in PyDough follow a hierarchy that includes both abstract
20
20
-[`PropertyMetadata`](property_metadata.py) (abstract): Base class for all property metadata.
21
21
-[`ScalarAttributeMetadata`](scalar_attribute_metadata.py) (abstract): Base class for properties that are scalars within each record of a collection.
22
22
- [`TableColumnMetadata`](table_column_metadata.py) (concrete): Represents a column of data from a relational table.
23
+
- [`MaskedTableColumnMetadata`](masked_table_column_metadata.py) (concrete): Represents a variant of a TableColumnMetadata where the data in the table has been encrypted by a masking protocol but the metadata stores information about that protocol, including how to unmask it when reading the data from the table.
23
24
-[`SubcollectionRelationshipMetadata`](subcollection_relationship_metadata.py) (abstract): Base class for properties that map to a subcollection of a collection.
24
25
-[`ReversiblePropertyMetadata`](reversible_property_metadata.py) (abstract): Base class for properties that map to a subcollection and have a corresponding reverse relationship.
25
26
- [`CartesianProductMetadata`](cartesian_product_metadata.py) (concrete): Represents a cartesian product between a collection and its subcollection.
0 commit comments