OCPBUGS-33013: Add atomicdir.Sync function #2027

tchap · 2025-10-01T14:04:33Z

The function can be used to atomically sync a directory with the desired state.
This uses atomicdir.swap implemented earlier.

The function is to be used in an improved implementation of the cert syncer.

Split from #2009

openshift-ci-robot · 2025-10-01T14:04:41Z

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The function can be used to atomically sync a directory with the desired state.
This uses atomicdir.swap implemented earlier.

Split from #2009

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tchap · 2025-10-01T14:05:23Z

/assign @p0lyn0mial

tchap · 2025-10-01T14:06:53Z

I am wondering, how are we doing with contextual/structured logging? Should I perhaps implement it for logging in this PR?

pkg/operator/staticpod/internal/atomicdir/sync.go

pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go

pkg/operator/staticpod/internal/atomicdir/sync.go

p0lyn0mial · 2025-10-03T06:40:39Z

pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go

+			filesToSync: map[string][]byte{
+				"1.txt": []byte("1"),
+			},
+			expectDirectorySynchronized: true,


It would be nice to see the expected files in the dir after the sync (expectedTargetDirectoryFiles).

I think that in this case the 2.txt file will be removed.

Now, I think that is different from the code that is currently in use. The question is if this is an issue.
So technically once we switch to this code some files might be removed from the target directory.

On the other hand after removing an entry from a CM/Secret we would want to remove the file, right ?

Please double check if the above is correct in https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/certsyncpod/certsync_controller.go#L71

Correct. We already discussed this in #2009 that this will now be different. We will be deleting files now.

Rewrote the test cases to contain expectedFiles.

I’m not sure if removing orphaned files is the best strategy. Just thinking out loud.

The previous code has been in use for years, and as we know, behaviour eventually becomes the API. Changing this behaviour might break clusters. Removing a file is an irreversible operation. Previously, deletion was only possible after removing the entire CM/Secret, not individual keys.

Maybe we should consider preserving the previous behaviour? One option would be to introduce a targetDirPreservePolicy that callers can set. If the policy is Keep, then we could simply copy orphaned files from the target directory to the temp directory. If there’s a real need to delete missing keys, that could be introduced per CM/Secret and passed into this method.

WDYT?

/cc @benluddy

What would the failure mode look like if we had a dependency on preserving orphaned files? It's hard for me to imagine a specific scenario where an orphaned file is necessary on an upgraded cluster but not on a freshly-installed cluster.

If we don't have a dependency on orphaned files, is preserving them an ongoing risk? If we were building this from scratch today, we would definitely not preserve them (I think?).

If we can't prove that pruning orphaned files is safe, but we think that it probably is, is there some process that would make us comfortable with pruning? For example, we could preserve orphaned files with telemetry to detect when it happens in practice, or back up any orphaned files and keep them around for disaster recovery purposes.

I lean toward pruning them, but it also seems unlikely to me that we'd miss an issue like this during upgrade testing.

I can't say I've been with OpenShift for too long, but it seems to me very unlikely as well that such an issue would go unnoticed. The orphaned files are simply not present on any new cluster you decide to install, right? So from my perspective, I am fine with pruning, but yeah, I can't prove anything.

Yeah, I also think that not removing files created based on the content of a CM/Secret would be considered a bug. OK, if you both lean toward pruning, then I’m fine with pruning as well.

p0lyn0mial · 2025-10-03T07:59:39Z

this PR looks great, thanks!

/lgtm

/hold

for #2027 (comment)
(if we decide to preserve the prev behaviour then we can open a new pr)

pkg/operator/staticpod/internal/atomicdir/sync.go

benluddy · 2025-10-07T21:16:55Z

pkg/operator/staticpod/internal/atomicdir/sync.go

+	}
+
+	klog.Infof("Creating temporary directory to swap for %q ...", targetDir)
+	tmpDir, err := fs.MkdirTemp(filepath.Dir(targetDir), filepath.Base(targetDir)+"-*")


Could we handle a scenario where sync repeatedly fails between populating the staging directory and cleaning up?

What if the caller provided its own (static) directory path to use for staging instead of generating a random path internally? We'd have an upper limit of 1 abandoned staging directory per unique caller.

Yeah, we could do that. My idea was actually to delete any leftovers in the sync loop where this function will be used. But it's true that actually passing the staging directory makes this easier as the caller then knows where it is and can do garbage collection easily.

Added the explicit staging dir in a separate commit.

I’m fine with providing a static dir, though in practice the immediate callers will just create temporary dirs.

Note that this package is internal, and we fully control the callers. For our own convenience, it’s simply more practical to have this function create tmp dir.

If the callers are responsible for creating the static dir, then I also think they should be responsible for removing it.

The staging dir is created and deleted by this package, but specified by the caller.

We actually wanted to incorporate the caller, I guess, so it can be staging/certsync/secrets/tls-cert, perhaps.

remember, there’s more than one process using this function. We need to make sure the staging dirs are unique, otherwise these two processes will step on each other’s toes, right?

Yep, that's why I mentioned the caller in the next comment. You can actually just delete staging/certsync every time sync function is called, because you have full control.

ok, so does that mean we’ll add a unique caller "id" to the staging dir for each process?

I would certainly do that.

benluddy · 2025-10-07T22:10:52Z

pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go

+			filesToSync: map[string][]byte{
+				"1.txt": []byte("1"),
+			},
+			expectDirectorySynchronized: true,


What would the failure mode look like if we had a dependency on preserving orphaned files? It's hard for me to imagine a specific scenario where an orphaned file is necessary on an upgraded cluster but not on a freshly-installed cluster.

If we don't have a dependency on orphaned files, is preserving them an ongoing risk? If we were building this from scratch today, we would definitely not preserve them (I think?).

If we can't prove that pruning orphaned files is safe, but we think that it probably is, is there some process that would make us comfortable with pruning? For example, we could preserve orphaned files with telemetry to detect when it happens in practice, or back up any orphaned files and keep them around for disaster recovery purposes.

I lean toward pruning them, but it also seems unlikely to me that we'd miss an issue like this during upgrade testing.

pkg/operator/staticpod/internal/atomicdir/sync.go

The function can be used to atomically sync a directory with the desired state. This uses atomicdir.swap implemented earlier.

p0lyn0mial · 2025-10-09T12:56:43Z

@benluddy is there anything else we should take into consideration ? if not, please tag the pr.

p0lyn0mial · 2025-10-09T12:57:03Z

/hold cancel

benluddy · 2025-10-09T13:06:09Z

/lgtm

openshift-ci · 2025-10-09T13:07:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, p0lyn0mial, tchap

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/operator/staticpod/OWNERS~~ [p0lyn0mial]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-10-09T13:10:39Z

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-10-09T13:14:32Z

@tchap: Jira Issue OCPBUGS-33013: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

openshift/library-go#2009 is open

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-33013 has not been moved to the MODIFIED state.

In response to this:

The function can be used to atomically sync a directory with the desired state.
This uses atomicdir.swap implemented earlier.

The function is to be used in an improved implementation of the cert syncer.

Split from #2009

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 1, 2025

openshift-ci bot requested a review from wangke19 October 1, 2025 14:04

openshift-ci bot assigned p0lyn0mial Oct 1, 2025

openshift-ci bot requested review from dgrisonnet and tkashem October 1, 2025 14:05

p0lyn0mial reviewed Oct 2, 2025

View reviewed changes

tchap force-pushed the sync-directory branch from 1726aa6 to bda0e77 Compare October 2, 2025 09:17

p0lyn0mial reviewed Oct 2, 2025

View reviewed changes

tchap force-pushed the sync-directory branch 6 times, most recently from 55468f5 to 4bebc96 Compare October 2, 2025 13:25

p0lyn0mial reviewed Oct 3, 2025

View reviewed changes

tchap force-pushed the sync-directory branch from 4bebc96 to 14c4aa9 Compare October 3, 2025 07:06

openshift-ci bot requested a review from benluddy October 3, 2025 07:53

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 3, 2025

benluddy reviewed Oct 7, 2025

View reviewed changes

tchap force-pushed the sync-directory branch from 14c4aa9 to 4a3dffb Compare October 8, 2025 12:09

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2025

p0lyn0mial reviewed Oct 9, 2025

View reviewed changes

pkg/operator/staticpod/internal/atomicdir/sync.go Outdated Show resolved Hide resolved

pkg/operator/staticpod/internal/atomicdir/sync.go Outdated Show resolved Hide resolved

pkg/operator/staticpod/internal/atomicdir/sync.go Outdated Show resolved Hide resolved

Add atomicdir.Sync function

0a7a574

The function can be used to atomically sync a directory with the desired state. This uses atomicdir.swap implemented earlier.

tchap force-pushed the sync-directory branch from 4a3dffb to 0a7a574 Compare October 9, 2025 12:55

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2025

openshift-ci bot assigned benluddy Oct 9, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2025

openshift-merge-bot bot merged commit 6c2d3d0 into openshift:master Oct 9, 2025
4 checks passed

tchap deleted the sync-directory branch October 9, 2025 13:15

OCPBUGS-33013: Add atomicdir.Sync function #2027

OCPBUGS-33013: Add atomicdir.Sync function #2027

Uh oh!

Conversation

tchap commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 1, 2025

Uh oh!

tchap commented Oct 1, 2025

Uh oh!

tchap commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p0lyn0mial commented Oct 3, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p0lyn0mial commented Oct 9, 2025

Uh oh!

p0lyn0mial commented Oct 9, 2025

Uh oh!

benluddy commented Oct 9, 2025

Uh oh!

tchap commented Oct 1, 2025 •

edited

Loading