Skip to content

[release/v25.1.x] Add migration job to handle mismatched field managers (#1249)#1250

Closed
github-actions[bot] wants to merge 1 commit intorelease/v25.1.xfrom
backport/release/v25.1.x/pr-1249
Closed

[release/v25.1.x] Add migration job to handle mismatched field managers (#1249)#1250
github-actions[bot] wants to merge 1 commit intorelease/v25.1.xfrom
backport/release/v25.1.x/pr-1249

Conversation

@github-actions
Copy link

Backport

This will backport the following commits from main to release/v25.1.x:

Questions ?

Please refer to the Backport tool documentation

In previous versions of the operator, when some of our synchronization code was ported to the `kube` library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the `kube` package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our `kube.Ctl` usage.

This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation.

This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to.

What this means is that we have to:

1. Clear all of the field managers that are mis-named
2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that
3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields.

The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.

(cherry picked from commit f1112cb)

# Conflicts:
#	acceptance/go.mod
#	acceptance/go.sum
#	acceptance/steps/register.go
#	charts/connectors/go.mod
#	charts/connectors/go.sum
#	charts/console/go.mod
#	charts/console/go.sum
#	charts/redpanda/go.mod
#	charts/redpanda/go.sum
#	charts/redpanda/render_state_nogotohelm.go
#	flake.nix
#	gen/go.mod
#	gen/go.sum
#	go.work.sum
#	gotohelm/go.mod
#	gotohelm/go.sum
#	gotohelm/testdata/src/example/go.mod
#	gotohelm/testdata/src/example/go.sum
#	harpoon/go.mod
#	harpoon/go.sum
#	licenses/third_party.md
#	operator/chart/rbac.go
#	operator/chart/templates/_chart.go.tpl
#	operator/chart/templates/_rbac.go.tpl
#	operator/chart/testdata/template-cases.golden.txtar
#	operator/cmd/main.go
#	operator/cmd/run/run.go
#	operator/go.mod
#	operator/go.sum
#	operator/multicluster/render_state_nogotohelm.go
#	pkg/go.mod
#	pkg/go.sum
@andrewstucki
Copy link
Contributor

backporting to 25.1.3 isn't necessary, looks like the regression was added during a refactor in the stretch that was the 25.2.x release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant