-
Notifications
You must be signed in to change notification settings - Fork 244
OCPBUGS-33013: Add atomicdir.Sync function #2027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/assign @p0lyn0mial |
I am wondering, how are we doing with contextual/structured logging? Should I perhaps implement it for logging in this PR? |
55468f5
to
4bebc96
Compare
filesToSync: map[string][]byte{ | ||
"1.txt": []byte("1"), | ||
}, | ||
expectDirectorySynchronized: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to see the expected files in the dir after the sync (expectedTargetDirectoryFiles
).
I think that in this case the 2.txt
file will be removed.
Now, I think that is different from the code that is currently in use. The question is if this is an issue.
So technically once we switch to this code some files might be removed from the target directory.
On the other hand after removing an entry from a CM/Secret we would want to remove the file, right ?
Please double check if the above is correct in https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/certsyncpod/certsync_controller.go#L71
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. We already discussed this in #2009 that this will now be different. We will be deleting files now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote the test cases to contain expectedFiles
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure if removing orphaned files is the best strategy. Just thinking out loud.
The previous code has been in use for years, and as we know, behaviour eventually becomes the API. Changing this behaviour might break clusters. Removing a file is an irreversible operation. Previously, deletion was only possible after removing the entire CM/Secret, not individual keys.
Maybe we should consider preserving the previous behaviour? One option would be to introduce a targetDirPreservePolicy
that callers can set. If the policy is Keep
, then we could simply copy orphaned files from the target directory to the temp directory. If there’s a real need to delete missing keys, that could be introduced per CM/Secret and passed into this method.
WDYT?
/cc @benluddy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the failure mode look like if we had a dependency on preserving orphaned files? It's hard for me to imagine a specific scenario where an orphaned file is necessary on an upgraded cluster but not on a freshly-installed cluster.
If we don't have a dependency on orphaned files, is preserving them an ongoing risk? If we were building this from scratch today, we would definitely not preserve them (I think?).
If we can't prove that pruning orphaned files is safe, but we think that it probably is, is there some process that would make us comfortable with pruning? For example, we could preserve orphaned files with telemetry to detect when it happens in practice, or back up any orphaned files and keep them around for disaster recovery purposes.
I lean toward pruning them, but it also seems unlikely to me that we'd miss an issue like this during upgrade testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I've been with OpenShift for too long, but it seems to me very unlikely as well that such an issue would go unnoticed. The orphaned files are simply not present on any new cluster you decide to install, right? So from my perspective, I am fine with pruning, but yeah, I can't prove anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also think that not removing files created based on the content of a CM/Secret would be considered a bug. OK, if you both lean toward pruning, then I’m fine with pruning as well.
this PR looks great, thanks! /lgtm /hold for #2027 (comment) |
} | ||
|
||
klog.Infof("Creating temporary directory to swap for %q ...", targetDir) | ||
tmpDir, err := fs.MkdirTemp(filepath.Dir(targetDir), filepath.Base(targetDir)+"-*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we handle a scenario where sync repeatedly fails between populating the staging directory and cleaning up?
What if the caller provided its own (static) directory path to use for staging instead of generating a random path internally? We'd have an upper limit of 1 abandoned staging directory per unique caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could do that. My idea was actually to delete any leftovers in the sync loop where this function will be used. But it's true that actually passing the staging directory makes this easier as the caller then knows where it is and can do garbage collection easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the explicit staging dir in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m fine with providing a static dir, though in practice the immediate callers will just create temporary dirs.
Note that this package is internal, and we fully control the callers. For our own convenience, it’s simply more practical to have this function create tmp dir.
If the callers are responsible for creating the static dir, then I also think they should be responsible for removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The staging dir is created and deleted by this package, but specified by the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually wanted to incorporate the caller, I guess, so it can be staging/certsync/secrets/tls-cert
, perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember, there’s more than one process using this function. We need to make sure the staging dirs are unique, otherwise these two processes will step on each other’s toes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's why I mentioned the caller in the next comment. You can actually just delete staging/certsync
every time sync
function is called, because you have full control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so does that mean we’ll add a unique caller "id" to the staging dir for each process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would certainly do that.
filesToSync: map[string][]byte{ | ||
"1.txt": []byte("1"), | ||
}, | ||
expectDirectorySynchronized: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the failure mode look like if we had a dependency on preserving orphaned files? It's hard for me to imagine a specific scenario where an orphaned file is necessary on an upgraded cluster but not on a freshly-installed cluster.
If we don't have a dependency on orphaned files, is preserving them an ongoing risk? If we were building this from scratch today, we would definitely not preserve them (I think?).
If we can't prove that pruning orphaned files is safe, but we think that it probably is, is there some process that would make us comfortable with pruning? For example, we could preserve orphaned files with telemetry to detect when it happens in practice, or back up any orphaned files and keep them around for disaster recovery purposes.
I lean toward pruning them, but it also seems unlikely to me that we'd miss an issue like this during upgrade testing.
The function can be used to atomically sync a directory with the desired state. This uses atomicdir.swap implemented earlier.
@benluddy is there anything else we should take into consideration ? if not, please tag the pr. |
/hold cancel |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: benluddy, p0lyn0mial, tchap The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@tchap: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@tchap: Jira Issue OCPBUGS-33013: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-33013 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
The function can be used to atomically sync a directory with the desired state.
This uses
atomicdir.swap
implemented earlier.The function is to be used in an improved implementation of the cert syncer.
Split from #2009