Skip to content

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Aug 26, 2025

Adding crucial rewrite step for masked table columns: During relational conversion, if a column is a masked table column, place a PROJECT on top of the SCAN node fetching data from the table. This PROJECT node will invoke an UNMASK operator (containing information from the metadata for the masked table column) which will transform the masked columns into their unmasked forms.

Updates existing tests to account for this transformation, ensuring that the UNMASK calls are injected into the underlying SQL (and pulled up as late as possible by projection pullup). All the tests in test_masked_sqlite.py will now run with the correct e2e answers, as the decryption of the underlying data is now undone by the generated query.

Base automatically changed from kian/sqlite_masked_tests to main September 8, 2025 16:48
Comment on lines -40 to +47
SELECT *
-- 42 - column1, -- ARITHMETIC SHIFT: 42
-- UPPER(column2), -- UPPERCASE
-- UPPER(column3), -- UPPERCASE
-- REPLACE(REPLACE(REPLACE(column4, '0', '*'), '9', '0'), '*', '9'), -- DIGIT SWITCH: 0 <-> 9
-- SUBSTRING(column5, 2) || SUBSTRING(column5, 1, 1), -- FIRST CHAR TRANSPOSE
-- SUBSTRING(column6, 2) || SUBSTRING(column6, 1, 1), -- FIRST CHAR TRANSPOSE
-- DATE(column7, '-472 days') -- DAY SHIFT: 472
SELECT
42 - column1, -- ARITHMETIC SHIFT: 42
UPPER(column2), -- UPPERCASE
UPPER(column3), -- UPPERCASE
REPLACE(REPLACE(REPLACE(column4, '0', '*'), '9', '0'), '*', '9'), -- DIGIT SWITCH: 0 <-> 9
SUBSTRING(column5, 2) || SUBSTRING(column5, 1, 1), -- FIRST CHAR TRANSPOSE
SUBSTRING(column6, 2) || SUBSTRING(column6, 1, 1), -- FIRST CHAR TRANSPOSE
DATE(column7, '-472 days') -- DAY SHIFT: 472
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to restore encryption in the original PR, but its fine bc the E2E tests were being skipped until now

Comment on lines +175 to +181
if isinstance(
operator,
(
pydop.MaskedExpressionFunctionOperator,
pydop.SqlMacroExpressionFunctionOperator,
),
):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just extending the logic we already use for UDFs with macro text, but now with the new MaskedExpressionFunctionOperator operator, which contains a format string. This format string is either the unprotect or protect format string, depending on whether is_unprotect is True/False in the operator.

Comment on lines 67 to 70
def to_string(self, arg_strings: list[str]) -> str:
name: str = "UNMASK" if self.is_unprotect else "MASK"
arg_strings = [f"[{s}]" for s in arg_strings]
return f"{name}::({self.format_string.format(*arg_strings)})"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly for relational plans, so we can clearly delineate the mask/unmask call, the logic within, and where the arguments get injected. For example, if I am unmasking the expression foo - 7, and the logic to unmask x is (2 * x) - 1, then this gets stringified as UNMASK::((2 * {foo - 7}) - 1)

Comment on lines 57 to 65
"""
The format string to use for this operator to either mask or unmask the
operand.
"""
return (
self.masking_metadata.unprotect_protocol
if self.is_unprotect
else self.masking_metadata.protect_protocol
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch logic makes the SQL conversion step seemless

def __init__(
self,
masking_metadata: MaskedTableColumnMetadata,
is_unprotect: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We store the object with the metadata & this boolean so all we need to do for #418 to create a MASK call is create a copy of the operator with is_unprotect toggled to False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why named is_unprotect, for consistency is_unmasked makes more sense since class is called Masked...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but will rename to is_unmask since it is describing whether the function masks or unmasks,.

@knassre-bodo knassre-bodo requested a review from a team September 9, 2025 17:55
# If any of the columns are masked, insert a projection on top to unmask
# them.
if any(
isinstance(expr, HybridColumnExpr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create a helper function for code readability:

def _is_a_masked_column(expr):
    return isinstance(expr, HybridColumnExpr) 
        and isinstance(expr.column.column_property, MaskedTableColumnMetadata)

Then:

if any (
    _is_a_masked_column(expr) 
    for expr in node.terms.values()
):

):
unmask_columns: dict[str, RelationalExpression] = {}
for name, hybrid_expr in node.terms.items():
if isinstance(hybrid_expr, HybridColumnExpr) and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the helper function here: if _is_a_masked_column(hybrid_expr):

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.
I just have one comment on renaming some variables.

def __init__(
self,
masking_metadata: MaskedTableColumnMetadata,
is_unprotect: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why named is_unprotect, for consistency is_unmasked makes more sense since class is called Masked...?

Comment on lines +34 to +36
# Create a dummy verifier that requires exactly one argument, since all
# masking/unmasking operations are unary.
verifier: TypeVerifier = RequireNumArgs(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a guarantee for all different use cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since, from a PyDough perspective, the operator is always invoked the form UNMASK(arg) or MASK(arg). Internally, when the macro gets expanded, it may contain multiple arguments, but it is only parameterized on one argument (the thing getting masked/unmasked).

Comment on lines 75 to 77
self.masking_metadata.unprotect_protocol
if self.is_unprotect
else self.masking_metadata.protect_protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. For consistency, we should rename those to be unmask_protocol and mask_protocol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are in the metadata, and are named to be consistent with the JSON fields protect protocol and unprotect protocol. If we want to change those, that's now an API spec change to something already merged. We can, but that's a separate discussion from this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both of you. That is a separate discussion from this PR but I think we should consider changing protect/unprotect references to mask/unmask.

unmask_columns: dict[str, RelationalExpression] = {}
for name, hybrid_expr in node.terms.items():
if self.is_masked_column(hybrid_expr):
assert isinstance(hybrid_expr, HybridColumnExpr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these assert required? We are here because is_masked_column(hybrid_expr) == true. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are required for mypy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, you are right.

Do something like def is_masked_column(self, expr: HybridExpr) -> TypeGuard[HybridColumnExpr]: for the return type will be OK with mypy? so we don't need extra assert code just for that?

I mean, it is OK since both asserts will be always true... I was just thinking there should be way to do that for mypy.

@knassre-bodo knassre-bodo changed the base branch from main to Hadia/sf_masked_tests September 25, 2025 17:15
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Base automatically changed from Hadia/sf_masked_tests to main September 29, 2025 20:59
@knassre-bodo knassre-bodo merged commit 1be9e87 into main Sep 30, 2025
12 checks passed
@knassre-bodo knassre-bodo deleted the kian/masked_relational_rewrite branch September 30, 2025 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants