Add migration job to handle mismatched field managers#1249
Merged
andrewstucki merged 9 commits intomainfrom Jan 30, 2026
Merged
Add migration job to handle mismatched field managers#1249andrewstucki merged 9 commits intomainfrom
andrewstucki merged 9 commits intomainfrom
Conversation
paulohtb6
reviewed
Jan 29, 2026
RafalKorepta
requested changes
Jan 29, 2026
Contributor
There was a problem hiding this comment.
You need to add MigrationJobServiceAccount and PostUpgradeMigrationJob in
redpanda-operator/operator/chart/chart.go
Lines 51 to 64 in 0723c18
Contributor
Author
|
@RafalKorepta can you take a look again? I'm going to work on wiring up an acceptance/regression test for this now and fixing anything that breaks. In addition likely going to add a single pass SSA on any resources that need to be updated with the field specs as-is just so that our field manager will pick up any orphaned fields as part of the migration. |
RafalKorepta
approved these changes
Jan 29, 2026
github-actions bot
pushed a commit
that referenced
this pull request
Jan 30, 2026
In previous versions of the operator, when some of our synchronization code was ported to the `kube` library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the `kube` package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our `kube.Ctl` usage. This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation. This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to. What this means is that we have to: 1. Clear all of the field managers that are mis-named 2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that 3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields. The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources. (cherry picked from commit f1112cb) # Conflicts: # acceptance/go.mod # acceptance/go.sum # acceptance/steps/register.go # charts/connectors/go.mod # charts/connectors/go.sum # charts/console/go.mod # charts/console/go.sum # charts/redpanda/go.mod # charts/redpanda/go.sum # charts/redpanda/render_state_nogotohelm.go # flake.nix # gen/go.mod # gen/go.sum # go.work.sum # gotohelm/go.mod # gotohelm/go.sum # gotohelm/testdata/src/example/go.mod # gotohelm/testdata/src/example/go.sum # harpoon/go.mod # harpoon/go.sum # licenses/third_party.md # operator/chart/rbac.go # operator/chart/templates/_chart.go.tpl # operator/chart/templates/_rbac.go.tpl # operator/chart/testdata/template-cases.golden.txtar # operator/cmd/main.go # operator/cmd/run/run.go # operator/go.mod # operator/go.sum # operator/multicluster/render_state_nogotohelm.go # pkg/go.mod # pkg/go.sum
github-actions bot
pushed a commit
that referenced
this pull request
Jan 30, 2026
In previous versions of the operator, when some of our synchronization code was ported to the `kube` library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the `kube` package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our `kube.Ctl` usage. This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation. This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to. What this means is that we have to: 1. Clear all of the field managers that are mis-named 2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that 3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields. The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources. (cherry picked from commit f1112cb) # Conflicts: # acceptance/go.mod # acceptance/go.sum # charts/connectors/go.mod # charts/connectors/go.sum # charts/console/go.mod # charts/console/go.sum # charts/redpanda/go.mod # charts/redpanda/go.sum # gen/go.mod # gen/go.sum # go.work.sum # gotohelm/go.mod # gotohelm/go.sum # gotohelm/testdata/src/example/go.mod # gotohelm/testdata/src/example/go.sum # harpoon/go.mod # harpoon/go.sum # licenses/third_party.md # operator/chart/templates/_rbac.go.tpl # operator/chart/testdata/template-cases.golden.txtar # operator/cmd/main.go # operator/go.mod # operator/go.sum # operator/multicluster/render_state_nogotohelm.go # pkg/go.mod # pkg/go.sum
github-actions bot
pushed a commit
that referenced
this pull request
Jan 30, 2026
In previous versions of the operator, when some of our synchronization code was ported to the `kube` library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the `kube` package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our `kube.Ctl` usage. This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation. This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to. What this means is that we have to: 1. Clear all of the field managers that are mis-named 2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that 3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields. The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources. (cherry picked from commit f1112cb) # Conflicts: # acceptance/go.mod # acceptance/go.sum # charts/connectors/go.mod # charts/connectors/go.sum # charts/console/go.mod # charts/console/go.sum # charts/redpanda/go.mod # charts/redpanda/go.sum # gen/go.mod # gen/go.sum # go.work.sum # gotohelm/go.mod # gotohelm/go.sum # gotohelm/testdata/src/example/go.mod # gotohelm/testdata/src/example/go.sum # harpoon/go.mod # harpoon/go.sum # licenses/third_party.md # operator/chart/templates/_rbac.go.tpl # operator/chart/testdata/template-cases.golden.txtar # operator/cmd/main.go # operator/go.mod # operator/go.sum # operator/multicluster/render_state_nogotohelm.go # pkg/go.mod # pkg/go.sum
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation and see the Github Action logs for details |
andrewstucki
added a commit
that referenced
this pull request
Jan 30, 2026
…rs (#1249) (#1252) In previous versions of the operator, when some of our synchronization code was ported to the kube library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the kube package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our kube.Ctl usage. This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation. This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to. What this means is that we have to: 1. Clear all of the field managers that are mis-named 2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that 3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields. The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources. --------- Co-authored-by: Andrew Stucki <andrew.stucki@redpanda.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cover Letter
In previous versions of the operator, when some of our synchronization code was ported to the
kubelibrary, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in thekubepackage as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across ourkube.Ctlusage.This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation.
This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to.
What this means is that we have to:
The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.
Attached are two quick scripted recreations of what we were experiencing with Services and StatefulSets:
service-demo.sh
statefulset-demo.sh