Skip to content

ACK iam-controller restarts/cancelled contexts can lead to EntityAlreadyExists errors. #2497

@alionski

Description

@alionski

Describe the bug
The issue is potentially similar to #2103 but #2103 doesn't name the root cause so might be a separate bug.

We have observed the following:

  • an ACK Policy was created
  • ACK was preempted and thus moved to a different node/restarted
  • after the restart, the policy got stuck with status EntityAlreadyExists

Looking at the AWS Console, we see that the policy is there and it was created at May 23, 2025, 21:00 (UTC+02:00).
The k8s resource's creation timestamp is "2025-05-23T19:00:18Z".
Looking at the logs, we see that the creation timestamp coincides with the following error:

2025-05-23T19:00:19.808672231Z
error: RequestCanceled: request context canceled

It appears that the resource creation went through on the AWS side, and because of the cancelled context ACK IAM controller failed to update the status on the k8s resource, and thus after the restart it started treating it as a duplicate.

The k8s resource:

apiVersion: iam.services.k8s.aws/v1alpha1
kind: Policy
metadata:
  creationTimestamp: "2025-05-23T19:00:18Z"
  finalizers:
  - finalizers.iam.services.k8s.aws/Policy
  generation: 1
  labels:
    <redacted>
  name: <redacted>-policy
  namespace: default
  ownerReferences:
    < ... >
  resourceVersion: "22454815"
  uid: da98fe89-6d62-4e6d-a8d7-db3a34a77f65
spec:
  description: <redacted>
  name: <redacted>-policy
  path: /
  policyDocument: <redacted>
  tags:
    < ... >
status:
  ackResourceMetadata:
    ownerAccountID: "<redacted>"
    region: ap-east-1
  conditions:
  - message: "EntityAlreadyExists: A policy called <redacted>-policy already
      exists. Duplicate names are not allowed.\n\tstatus code: 409, request id: <some id>"
    status: "True"
    type: ACK.Recoverable
  - lastTransitionTime: "2025-05-26T13:16:49Z"
    message: Unable to determine if desired resource state matches latest observed
      state
    reason: "EntityAlreadyExists: A policy called <redacted>-policy already
      exists. Duplicate names are not allowed.\n\tstatus code: 409, request id: <some id>"
    status: Unknown
    type: ACK.ResourceSynced

Logs:

2025-05-23T19:00:19.804592193Z
message: Stopping and waiting for non leader election runnables

2025-05-23T19:00:19.804615114Z
message: Stopping and waiting for leader election runnables

2025-05-23T19:00:19.804622027Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804626056Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804629869Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804633073Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804636600Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804647902Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804652195Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804655897Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804660016Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804662889Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804764922Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804771343Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804774617Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804777556Z
message: Shutdown signal received, waiting for all workers to finish

2025-05-23T19:00:19.804780706Z
message: All workers finished

2025-05-23T19:00:19.804784236Z
message: All workers finished

2025-05-23T19:00:19.804795366Z
message: All workers finished

2025-05-23T19:00:19.804798865Z
message: All workers finished

2025-05-23T19:00:19.804802152Z
message: All workers finished

2025-05-23T19:00:19.804806649Z
message: All workers finished

2025-05-23T19:00:19.804815436Z
message: All workers finished

2025-05-23T19:00:19.804820649Z
message: All workers finished

2025-05-23T19:00:19.804824164Z
message: All workers finished

2025-05-23T19:00:19.804826969Z
message: All workers finished

2025-05-23T19:00:19.804852450Z
message: All workers finished

2025-05-23T19:00:19.804856040Z
message: All workers finished

2025-05-23T19:00:19.804859441Z
message: All workers finished

2025-05-23T19:00:19.808672231Z
error: RequestCanceled: request context canceled
caused by: context canceled
message: Reconciler error
stacktrace: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224

2025-05-23T19:00:19.808687782Z
message: All workers finished

2025-05-23T19:00:19.808692474Z
message: Stopping and waiting for caches

2025-05-23T19:00:19.808697464Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.OpenIDConnectProvider ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

2025-05-23T19:00:19.808700987Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.AdoptedResource ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

2025-05-23T19:00:19.808730730Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.Role ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

2025-05-23T19:00:19.808734290Z
message: Stopping and waiting for webhooks

2025-05-23T19:00:19.808743872Z
message: Stopping and waiting for HTTP servers

2025-05-23T19:00:19.808747360Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.Role ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

2025-05-23T19:00:19.808751152Z
message: Shutting down metrics server with timeout of 1 minute

2025-05-23T19:00:19.808754639Z
message: shutting down server

2025-05-23T19:00:19.808757809Z
message: Wait completed, proceeding to shutdown the manager

2025-05-23T19:01:06.585749008Z
message: initializing service controller

2025-05-23T19:01:06.685908336Z
message: Waited for the caches to sync

2025-05-23T19:01:06.846057805Z
message: starting manager

2025-05-23T19:01:06.846070712Z
message: Starting metrics server

2025-05-23T19:01:06.846074574Z
message: Serving metrics server

2025-05-23T19:01:06.846080582Z
message: starting server

2025-05-23T19:01:06.846084893Z
message: Starting EventSource

2025-05-23T19:01:06.846088548Z
message: Starting Controller

2025-05-23T19:01:06.846091597Z
message: Starting EventSource

2025-05-23T19:01:06.846094416Z
message: Starting Controller

2025-05-23T19:01:06.846097980Z
message: Starting EventSource

2025-05-23T19:01:06.846108437Z
message: Starting Controller

2025-05-23T19:01:06.846111350Z
message: Starting EventSource

2025-05-23T19:01:06.846114339Z
message: Starting Controller

2025-05-23T19:01:06.846118235Z
message: Starting EventSource

2025-05-23T19:01:06.846121149Z
message: Starting Controller

2025-05-23T19:01:06.846123905Z
message: Starting EventSource

2025-05-23T19:01:06.846126881Z
message: Starting Controller

2025-05-23T19:01:06.846130687Z
message: Starting EventSource

2025-05-23T19:01:06.846140248Z
message: Starting Controller

2025-05-23T19:01:06.846143327Z
message: Starting EventSource

2025-05-23T19:01:06.846146197Z
message: Starting Controller

2025-05-23T19:01:06.846152109Z
message: Starting EventSource

2025-05-23T19:01:06.846155018Z
message: Starting Controller

2025-05-23T19:01:06.846158029Z
message: Starting EventSource

2025-05-23T19:01:06.846161087Z
message: Starting Controller

2025-05-23T19:01:06.846163992Z
message: Starting EventSource

2025-05-23T19:01:06.846166777Z
message: Starting Controller

2025-05-23T19:01:06.846169552Z
message: Starting EventSource

2025-05-23T19:01:06.846172291Z
message: Starting Controller

2025-05-23T19:01:06.846175026Z
message: Starting EventSource

2025-05-23T19:01:06.846177875Z
message: Starting Controller

2025-05-23T19:01:06.846180581Z
message: Starting EventSource

2025-05-23T19:01:06.846183370Z
message: Starting Controller

2025-05-23T19:01:07.099318069Z
message: Starting workers

2025-05-23T19:01:07.099337153Z
message: Starting workers

2025-05-23T19:01:07.105984047Z
message: Starting workers

2025-05-23T19:01:07.105994587Z
message: Starting workers

2025-05-23T19:01:07.105998261Z
message: Starting workers

2025-05-23T19:01:07.106001684Z
message: Starting workers

2025-05-23T19:01:07.106004811Z
message: Starting workers

2025-05-23T19:01:07.106008222Z
message: Starting workers

2025-05-23T19:01:07.106011207Z
message: Starting workers

2025-05-23T19:01:07.106014007Z
message: Starting workers

2025-05-23T19:01:07.106018778Z
message: Starting workers

2025-05-23T19:01:07.106021590Z
message: Starting workers

2025-05-23T19:01:07.150131516Z
message: Starting workers

2025-05-23T19:01:07.199319848Z
message: Starting workers

2025-05-23T19:01:13.247343524Z
error: EntityAlreadyExists: A policy called <redacted>-policy already exists. Duplicate names are not allowed.
	status code: 409, request id: <some id>
message: Reconciler error
stacktrace: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224

2025-05-23T19:01:19.492725227Z
error: EntityAlreadyExists: A policy called <redacted>-policy already exists. Duplicate names are not allowed.
	status code: 409, request id: <some id>
message: Reconciler error
stacktrace: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224

< .. the error above continues forever ...>

Steps to reproduce

This seems to be tricky but the attempt to create the resource must coincide with the pod restarts/cancelled context, so:

  • create a resource
  • restart the iam controller pod and somehow make sure that the reconciliation attempt gets aborted / context gets cancelled.

Expected outcome
IAM controller either doesn't create the resource in AWS at all and instead reconciles it after the pod restart, or manages the full resources reconcilication before shutting down, i.e. sets the correct status and ARN before shutting down.

Environment

  • ACK iam-controller version:
    1.3.13

  • Kubernetes version:
    v1.30.11-eks-bcf3d70

  • Using EKS (yes/no), if so version?
    v1.30.11-eks-bcf3d70

  • AWS service targeted (S3, RDS, etc.):
    IAM

Metadata

Metadata

Assignees

Labels

service/allIndicates issues or PRs related to all the service controllers.service/iamIndicates issues or PRs that are related to iam-controller.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions