-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add GroupByEncryptedKey transform #36213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
8e69e93
Add GroupByEncryptedKey transform
damccorm 4c17662
Missing requirement
damccorm b2956ad
Merge branch 'master' of https://github.com/apache/beam into users/da…
damccorm 0331905
lint
damccorm cefc7c4
Import order
damccorm fab145a
keep type checking
damccorm ee14851
feedback
damccorm d43b5a8
comment disclaimer
damccorm 50443d1
doc note
damccorm d5080f4
Avoid secret naming conflicts
damccorm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert that the key coder is deterministic here/ call self.key_coder.as_deterministic_coder here? For the encrypted key to be deterministic, we assume that the actual key coder is deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if its possible to create a new coder that wraps the original coder along with doing the encryption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think defining this as a coder could be messy because then you would need to access the secret at construction time and include that as part of the serialized graph definition. This would then not provide sufficient security guarantees since the graph itself would have all information needed to decrypt the value.
Potentially you could include the work of downloading the secret in the coder definition, but I don't think we gain much from this today (and errors might be messy to debug). It also seems nice that we have the actual transform definition which helps make it obvious what is happening from the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a good call. Interestingly, we do check this for the direct runner GBK implementation, but not more broadly as far as I can tell. But we should definitely be verifying here.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think this will still circumvent update compat checks because we are pulling the original coders out of the graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it will, though at least the pipeline should cleanly fail in most cases when it tries to perform the encoding. For now, I will add a note to the doc string that this should be used with caution to avoid this issue.
In the future, there are maybe ways we could address this:
Neither of these is trivial, so for now I will leave this with the doc note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like the coder.encode could process a tuple (secret_version_name, unencrypted_key) and, use a output a (secret_version_name, encrypted_key)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well - it is pretty expensive though since we're serializing the secret name every time.
This is what I was trying to get around with
We could create a custom type and associated coder per GBEK instance which handles all encoding pieces.since the coder definition could contain the secret name.