Skip to content

Conversation

tchap
Copy link
Contributor

@tchap tchap commented Oct 1, 2025

The function can be used to atomically sync a directory with the desired state.
This uses atomicdir.swap implemented earlier.

The function is to be used in an improved implementation of the cert syncer.

Split from #2009

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 1, 2025
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The function can be used to atomically sync a directory with the desired state.
This uses atomicdir.swap implemented earlier.

Split from #2009

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 October 1, 2025 14:04
@tchap
Copy link
Contributor Author

tchap commented Oct 1, 2025

/assign @p0lyn0mial

@tchap
Copy link
Contributor Author

tchap commented Oct 1, 2025

I am wondering, how are we doing with contextual/structured logging? Should I perhaps implement it for logging in this PR?

@tchap tchap force-pushed the sync-directory branch 6 times, most recently from 55468f5 to 4bebc96 Compare October 2, 2025 13:25
filesToSync: map[string][]byte{
"1.txt": []byte("1"),
},
expectDirectorySynchronized: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to see the expected files in the dir after the sync (expectedTargetDirectoryFiles).

I think that in this case the 2.txt file will be removed.

Now, I think that is different from the code that is currently in use. The question is if this is an issue.
So technically once we switch to this code some files might be removed from the target directory.

On the other hand after removing an entry from a CM/Secret we would want to remove the file, right ?

Please double check if the above is correct in https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/certsyncpod/certsync_controller.go#L71

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We already discussed this in #2009 that this will now be different. We will be deleting files now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the test cases to contain expectedFiles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure if removing orphaned files is the best strategy. Just thinking out loud.

The previous code has been in use for years, and as we know, behaviour eventually becomes the API. Changing this behaviour might break clusters. Removing a file is an irreversible operation. Previously, deletion was only possible after removing the entire CM/Secret, not individual keys.

Maybe we should consider preserving the previous behaviour? One option would be to introduce a targetDirPreservePolicy that callers can set. If the policy is Keep, then we could simply copy orphaned files from the target directory to the temp directory. If there’s a real need to delete missing keys, that could be introduced per CM/Secret and passed into this method.

WDYT?

/cc @benluddy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the failure mode look like if we had a dependency on preserving orphaned files? It's hard for me to imagine a specific scenario where an orphaned file is necessary on an upgraded cluster but not on a freshly-installed cluster.

If we don't have a dependency on orphaned files, is preserving them an ongoing risk? If we were building this from scratch today, we would definitely not preserve them (I think?).

If we can't prove that pruning orphaned files is safe, but we think that it probably is, is there some process that would make us comfortable with pruning? For example, we could preserve orphaned files with telemetry to detect when it happens in practice, or back up any orphaned files and keep them around for disaster recovery purposes.

I lean toward pruning them, but it also seems unlikely to me that we'd miss an issue like this during upgrade testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I've been with OpenShift for too long, but it seems to me very unlikely as well that such an issue would go unnoticed. The orphaned files are simply not present on any new cluster you decide to install, right? So from my perspective, I am fine with pruning, but yeah, I can't prove anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also think that not removing files created based on the content of a CM/Secret would be considered a bug. OK, if you both lean toward pruning, then I’m fine with pruning as well.

@openshift-ci openshift-ci bot requested a review from benluddy October 3, 2025 07:53
@p0lyn0mial
Copy link
Contributor

this PR looks great, thanks!

/lgtm

/hold

for #2027 (comment)
(if we decide to preserve the prev behaviour then we can open a new pr)

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 3, 2025
}

klog.Infof("Creating temporary directory to swap for %q ...", targetDir)
tmpDir, err := fs.MkdirTemp(filepath.Dir(targetDir), filepath.Base(targetDir)+"-*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we handle a scenario where sync repeatedly fails between populating the staging directory and cleaning up?

What if the caller provided its own (static) directory path to use for staging instead of generating a random path internally? We'd have an upper limit of 1 abandoned staging directory per unique caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could do that. My idea was actually to delete any leftovers in the sync loop where this function will be used. But it's true that actually passing the staging directory makes this easier as the caller then knows where it is and can do garbage collection easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the explicit staging dir in a separate commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m fine with providing a static dir, though in practice the immediate callers will just create temporary dirs.

Note that this package is internal, and we fully control the callers. For our own convenience, it’s simply more practical to have this function create tmp dir.

If the callers are responsible for creating the static dir, then I also think they should be responsible for removing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The staging dir is created and deleted by this package, but specified by the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually wanted to incorporate the caller, I guess, so it can be staging/certsync/secrets/tls-cert, perhaps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember, there’s more than one process using this function. We need to make sure the staging dirs are unique, otherwise these two processes will step on each other’s toes, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's why I mentioned the caller in the next comment. You can actually just delete staging/certsync every time sync function is called, because you have full control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so does that mean we’ll add a unique caller "id" to the staging dir for each process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would certainly do that.

filesToSync: map[string][]byte{
"1.txt": []byte("1"),
},
expectDirectorySynchronized: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the failure mode look like if we had a dependency on preserving orphaned files? It's hard for me to imagine a specific scenario where an orphaned file is necessary on an upgraded cluster but not on a freshly-installed cluster.

If we don't have a dependency on orphaned files, is preserving them an ongoing risk? If we were building this from scratch today, we would definitely not preserve them (I think?).

If we can't prove that pruning orphaned files is safe, but we think that it probably is, is there some process that would make us comfortable with pruning? For example, we could preserve orphaned files with telemetry to detect when it happens in practice, or back up any orphaned files and keep them around for disaster recovery purposes.

I lean toward pruning them, but it also seems unlikely to me that we'd miss an issue like this during upgrade testing.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2025
The function can be used to atomically sync a directory with the desired
state. This uses atomicdir.swap implemented earlier.
@p0lyn0mial
Copy link
Contributor

@benluddy is there anything else we should take into consideration ? if not, please tag the pr.

@p0lyn0mial
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2025
@benluddy
Copy link
Contributor

benluddy commented Oct 9, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2025
Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, p0lyn0mial, tchap

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 6c2d3d0 into openshift:master Oct 9, 2025
4 checks passed
@openshift-ci-robot
Copy link

@tchap: Jira Issue OCPBUGS-33013: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-33013 has not been moved to the MODIFIED state.

In response to this:

The function can be used to atomically sync a directory with the desired state.
This uses atomicdir.swap implemented earlier.

The function is to be used in an improved implementation of the cert syncer.

Split from #2009

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap tchap deleted the sync-directory branch October 9, 2025 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants