Skip to content

Add migration job to handle mismatched field managers#1249

Merged
andrewstucki merged 9 commits intomainfrom
as/add-migration-job
Jan 30, 2026
Merged

Add migration job to handle mismatched field managers#1249
andrewstucki merged 9 commits intomainfrom
as/add-migration-job

Conversation

@andrewstucki
Copy link
Contributor

@andrewstucki andrewstucki commented Jan 29, 2026

Cover Letter

In previous versions of the operator, when some of our synchronization code was ported to the kube library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the kube package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our kube.Ctl usage.

This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation.

This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to.

What this means is that we have to:

  1. Clear all of the field managers that are mis-named
  2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that
  3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields.

The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.

Attached are two quick scripted recreations of what we were experiencing with Services and StatefulSets:

service-demo.sh
statefulset-demo.sh

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add MigrationJobServiceAccount and PostUpgradeMigrationJob in

manifests := []kube.Object{
Issuer(dot),
Certificate(dot),
ConfigMap(dot),
MetricsService(dot),
WebhookService(dot),
MutatingWebhookConfiguration(dot),
ValidatingWebhookConfiguration(dot),
ServiceAccount(dot),
ServiceMonitor(dot),
Deployment(dot),
PreInstallCRDJob(dot),
CRDJobServiceAccount(dot),
}

@andrewstucki
Copy link
Contributor Author

@RafalKorepta can you take a look again? I'm going to work on wiring up an acceptance/regression test for this now and fixing anything that breaks. In addition likely going to add a single pass SSA on any resources that need to be updated with the field specs as-is just so that our field manager will pick up any orphaned fields as part of the migration.

@andrewstucki andrewstucki enabled auto-merge (squash) January 29, 2026 22:42
Copy link
Member

@gene-redpanda gene-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@andrewstucki andrewstucki disabled auto-merge January 29, 2026 22:55
@andrewstucki andrewstucki enabled auto-merge (squash) January 29, 2026 22:55
@andrewstucki andrewstucki merged commit f1112cb into main Jan 30, 2026
10 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 30, 2026
In previous versions of the operator, when some of our synchronization code was ported to the `kube` library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the `kube` package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our `kube.Ctl` usage.

This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation.

This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to.

What this means is that we have to:

1. Clear all of the field managers that are mis-named
2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that
3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields.

The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.

(cherry picked from commit f1112cb)

# Conflicts:
#	acceptance/go.mod
#	acceptance/go.sum
#	acceptance/steps/register.go
#	charts/connectors/go.mod
#	charts/connectors/go.sum
#	charts/console/go.mod
#	charts/console/go.sum
#	charts/redpanda/go.mod
#	charts/redpanda/go.sum
#	charts/redpanda/render_state_nogotohelm.go
#	flake.nix
#	gen/go.mod
#	gen/go.sum
#	go.work.sum
#	gotohelm/go.mod
#	gotohelm/go.sum
#	gotohelm/testdata/src/example/go.mod
#	gotohelm/testdata/src/example/go.sum
#	harpoon/go.mod
#	harpoon/go.sum
#	licenses/third_party.md
#	operator/chart/rbac.go
#	operator/chart/templates/_chart.go.tpl
#	operator/chart/templates/_rbac.go.tpl
#	operator/chart/testdata/template-cases.golden.txtar
#	operator/cmd/main.go
#	operator/cmd/run/run.go
#	operator/go.mod
#	operator/go.sum
#	operator/multicluster/render_state_nogotohelm.go
#	pkg/go.mod
#	pkg/go.sum
github-actions bot pushed a commit that referenced this pull request Jan 30, 2026
In previous versions of the operator, when some of our synchronization code was ported to the `kube` library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the `kube` package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our `kube.Ctl` usage.

This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation.

This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to.

What this means is that we have to:

1. Clear all of the field managers that are mis-named
2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that
3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields.

The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.

(cherry picked from commit f1112cb)

# Conflicts:
#	acceptance/go.mod
#	acceptance/go.sum
#	charts/connectors/go.mod
#	charts/connectors/go.sum
#	charts/console/go.mod
#	charts/console/go.sum
#	charts/redpanda/go.mod
#	charts/redpanda/go.sum
#	gen/go.mod
#	gen/go.sum
#	go.work.sum
#	gotohelm/go.mod
#	gotohelm/go.sum
#	gotohelm/testdata/src/example/go.mod
#	gotohelm/testdata/src/example/go.sum
#	harpoon/go.mod
#	harpoon/go.sum
#	licenses/third_party.md
#	operator/chart/templates/_rbac.go.tpl
#	operator/chart/testdata/template-cases.golden.txtar
#	operator/cmd/main.go
#	operator/go.mod
#	operator/go.sum
#	operator/multicluster/render_state_nogotohelm.go
#	pkg/go.mod
#	pkg/go.sum
github-actions bot pushed a commit that referenced this pull request Jan 30, 2026
In previous versions of the operator, when some of our synchronization code was ported to the `kube` library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the `kube` package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our `kube.Ctl` usage.

This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation.

This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to.

What this means is that we have to:

1. Clear all of the field managers that are mis-named
2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that
3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields.

The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.

(cherry picked from commit f1112cb)

# Conflicts:
#	acceptance/go.mod
#	acceptance/go.sum
#	charts/connectors/go.mod
#	charts/connectors/go.sum
#	charts/console/go.mod
#	charts/console/go.sum
#	charts/redpanda/go.mod
#	charts/redpanda/go.sum
#	gen/go.mod
#	gen/go.sum
#	go.work.sum
#	gotohelm/go.mod
#	gotohelm/go.sum
#	gotohelm/testdata/src/example/go.mod
#	gotohelm/testdata/src/example/go.sum
#	harpoon/go.mod
#	harpoon/go.sum
#	licenses/third_party.md
#	operator/chart/templates/_rbac.go.tpl
#	operator/chart/testdata/template-cases.golden.txtar
#	operator/cmd/main.go
#	operator/go.mod
#	operator/go.sum
#	operator/multicluster/render_state_nogotohelm.go
#	pkg/go.mod
#	pkg/go.sum
@github-actions
Copy link

💚 All backports created successfully

Status Branch Result
release/v25.1.x
release/v25.2.x
release/v25.3.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

andrewstucki added a commit that referenced this pull request Jan 30, 2026
…rs (#1249) (#1252)

In previous versions of the operator, when some of our synchronization code was ported to the kube library, a bug disallowing setting the field manager was introduced (see redpanda-data/common-go#126 for the relevant fix in the kube package as it exists today). Additionally, we have been inconsistent with the way we have set the field manager across our kube.Ctl usage.

This was resulting in some really odd behavior with the Kubernetes API server mangling resources due to conflicting field management versions. For example, service ports get merged via an identity of their (protocol, port) tuple. Having an old field manager saying it owned the service port (tcp, 9092) which was named "kafka" and then applying, with the new manager, a version of our CRD where the port was overwritten to be 19092 was resulting in the API server seeing both, due to the conflicting field manager names, ports (tcp, 9092) and (tcp, 19092) named "kafka", which failed validation.

This has an even more difficult to resolve knock-on effect when the resources being merged don't fail validation immediately. For example, StatefulSets will gladly take duplicated port names in their pod template container definitions. However, when they go to actually provision the Pods, then they will fail to.

What this means is that we have to:

1. Clear all of the field managers that are mis-named
2. Assume ownership over all fields as they currently exist in the resources that we have created via server-side apply, so that
3. When re-reconciliation kicks in, not only will resources that would otherwise fail validation succeed, but resources that are mangled due to things like pod template container ports being merged, will get cleared up due to our proper field owner owning all of the relevant spec fields.

The way this is resolved is through a post-upgrade migration job that was added to remove any unwanted field managers of any relevant resources related to Redpanda and Console CRDs, and forcibly assume ownership over their fields with the proper field manager. Subsequently our reconcilers will pick up and fix any malformed resources.

---------

Co-authored-by: Andrew Stucki <andrew.stucki@redpanda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants