-
Notifications
You must be signed in to change notification settings - Fork 3
Add masked table column relational unprotect rewrite #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… implemented [RUN CI]
SELECT * | ||
-- 42 - column1, -- ARITHMETIC SHIFT: 42 | ||
-- UPPER(column2), -- UPPERCASE | ||
-- UPPER(column3), -- UPPERCASE | ||
-- REPLACE(REPLACE(REPLACE(column4, '0', '*'), '9', '0'), '*', '9'), -- DIGIT SWITCH: 0 <-> 9 | ||
-- SUBSTRING(column5, 2) || SUBSTRING(column5, 1, 1), -- FIRST CHAR TRANSPOSE | ||
-- SUBSTRING(column6, 2) || SUBSTRING(column6, 1, 1), -- FIRST CHAR TRANSPOSE | ||
-- DATE(column7, '-472 days') -- DAY SHIFT: 472 | ||
SELECT | ||
42 - column1, -- ARITHMETIC SHIFT: 42 | ||
UPPER(column2), -- UPPERCASE | ||
UPPER(column3), -- UPPERCASE | ||
REPLACE(REPLACE(REPLACE(column4, '0', '*'), '9', '0'), '*', '9'), -- DIGIT SWITCH: 0 <-> 9 | ||
SUBSTRING(column5, 2) || SUBSTRING(column5, 1, 1), -- FIRST CHAR TRANSPOSE | ||
SUBSTRING(column6, 2) || SUBSTRING(column6, 1, 1), -- FIRST CHAR TRANSPOSE | ||
DATE(column7, '-472 days') -- DAY SHIFT: 472 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to restore encryption in the original PR, but its fine bc the E2E tests were being skipped until now
if isinstance( | ||
operator, | ||
( | ||
pydop.MaskedExpressionFunctionOperator, | ||
pydop.SqlMacroExpressionFunctionOperator, | ||
), | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just extending the logic we already use for UDFs with macro text, but now with the new MaskedExpressionFunctionOperator
operator, which contains a format string. This format string is either the unprotect or protect format string, depending on whether is_unprotect
is True/False in the operator.
def to_string(self, arg_strings: list[str]) -> str: | ||
name: str = "UNMASK" if self.is_unprotect else "MASK" | ||
arg_strings = [f"[{s}]" for s in arg_strings] | ||
return f"{name}::({self.format_string.format(*arg_strings)})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly for relational plans, so we can clearly delineate the mask/unmask call, the logic within, and where the arguments get injected. For example, if I am unmasking the expression foo - 7
, and the logic to unmask x
is (2 * x) - 1
, then this gets stringified as UNMASK::((2 * {foo - 7}) - 1)
""" | ||
The format string to use for this operator to either mask or unmask the | ||
operand. | ||
""" | ||
return ( | ||
self.masking_metadata.unprotect_protocol | ||
if self.is_unprotect | ||
else self.masking_metadata.protect_protocol | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch logic makes the SQL conversion step seemless
def __init__( | ||
self, | ||
masking_metadata: MaskedTableColumnMetadata, | ||
is_unprotect: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store the object with the metadata & this boolean so all we need to do for #418 to create a MASK call is create a copy of the operator with is_unprotect
toggled to False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why named is_unprotect
, for consistency is_unmasked
makes more sense since class is called Masked...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but will rename to is_unmask
since it is describing whether the function masks or unmasks,.
# If any of the columns are masked, insert a projection on top to unmask | ||
# them. | ||
if any( | ||
isinstance(expr, HybridColumnExpr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create a helper function for code readability:
def _is_a_masked_column(expr):
return isinstance(expr, HybridColumnExpr)
and isinstance(expr.column.column_property, MaskedTableColumnMetadata)
Then:
if any (
_is_a_masked_column(expr)
for expr in node.terms.values()
):
): | ||
unmask_columns: dict[str, RelationalExpression] = {} | ||
for name, hybrid_expr in node.terms.items(): | ||
if isinstance(hybrid_expr, HybridColumnExpr) and isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the helper function here: if _is_a_masked_column(hybrid_expr):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me.
I just have one comment on renaming some variables.
def __init__( | ||
self, | ||
masking_metadata: MaskedTableColumnMetadata, | ||
is_unprotect: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why named is_unprotect
, for consistency is_unmasked
makes more sense since class is called Masked...?
# Create a dummy verifier that requires exactly one argument, since all | ||
# masking/unmasking operations are unary. | ||
verifier: TypeVerifier = RequireNumArgs(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a guarantee for all different use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since, from a PyDough perspective, the operator is always invoked the form UNMASK(arg)
or MASK(arg)
. Internally, when the macro gets expanded, it may contain multiple arguments, but it is only parameterized on one argument (the thing getting masked/unmasked).
self.masking_metadata.unprotect_protocol | ||
if self.is_unprotect | ||
else self.masking_metadata.protect_protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. For consistency, we should rename those to be unmask_protocol
and mask_protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are in the metadata, and are named to be consistent with the JSON fields protect protocol
and unprotect protocol
. If we want to change those, that's now an API spec change to something already merged. We can, but that's a separate discussion from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with both of you. That is a separate discussion from this PR but I think we should consider changing protect/unprotect references to mask/unmask.
unmask_columns: dict[str, RelationalExpression] = {} | ||
for name, hybrid_expr in node.terms.items(): | ||
if self.is_masked_column(hybrid_expr): | ||
assert isinstance(hybrid_expr, HybridColumnExpr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these assert required? We are here because is_masked_column(hybrid_expr) == true
. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are required for mypy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, you are right.
Do something like def is_masked_column(self, expr: HybridExpr) -> TypeGuard[HybridColumnExpr]:
for the return type will be OK with mypy? so we don't need extra assert code just for that?
I mean, it is OK since both asserts will be always true... I was just thinking there should be way to do that for mypy.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Adding crucial rewrite step for masked table columns: During relational conversion, if a column is a masked table column, place a
PROJECT
on top of theSCAN
node fetching data from the table. ThisPROJECT
node will invoke anUNMASK
operator (containing information from the metadata for the masked table column) which will transform the masked columns into their unmasked forms.Updates existing tests to account for this transformation, ensuring that the
UNMASK
calls are injected into the underlying SQL (and pulled up as late as possible by projection pullup). All the tests intest_masked_sqlite.py
will now run with the correct e2e answers, as the decryption of the underlying data is now undone by the generated query.