Skip to content

kustomization-controller shows significant lag when there are failing kustomizations #5622

@naegleria

Description

@naegleria

Describe the bug

Hi,
Here's another kustomization-controller slowness related issue, because the other ones seemed unrelated. We have a K8s cluster that contains about 128 Kustomizations in flux-system namespace (deployed according to https://github.com/fluxcd/flux2-kustomize-helm-example), 11 out of these are failing:

service-1       30d    False     health check failed after 44.855714ms: failed early due to stalled resources: [Deployment/namespace/service-1 status: 'Failed']
service-2       30d    False     health check failed after 78.726975ms: failed early due to stalled resources: [Deployment/namespace/service-2 status: 'Failed']
service-3       7d1h   False     health check failed after 9m30.015478679s: timeout waiting for: [Deployment/namespace/service-3: 'NotFound']
service-4       4d8h   False     health check failed after 9m30.014903975s: timeout waiting for: [Deployment/namespace/service-4 status: 'NotFound']
service-5       6d9h   False     health check failed after 9m30.015001798s: timeout waiting for: [Deployment/namespace/service-5 status: 'NotFound']
service-6       52d    Unknown   Reconciliation in progress
service-7       7d1h   False     health check failed after 47.145335ms: failed early due to stalled resources: [Deployment/namespace/service-7 status: 'Failed']
service-8       7d1h   Unknown   Reconciliation in progress
service-9       4d8h   Unknown   Reconciliation in progress
service-10      52d    False     health check failed after 9m30.017330668s: timeout waiting for: [Deployment/namespace/service-10 status: 'NotFound']
service-11      73d    False     health check failed after 9m30.01314632s: timeout waiting for: [Deployment/namespace/service-11 status: 'NotFound']

All of them have healthcheck enabled, all fail at HelmRelease component (although I'm not sure why kustomization reports 3 services as 'Reconciliation in progress' since the HelmRelease fault is that the Helm chart is missing, like for most others).

Now when I created a new kustomization to deploy cert-manager, I noticed that even though source-controller read in the git repository change, the kustomization resource for cert-manager was not created. Without having changed anything, it finally appeared x minutes later, but even then it was created without a status:

kubectl -n flux-system describe kustomizations.kustomize.toolkit.fluxcd.io cert-manager
...
...
Status:
  Observed Generation:  -1
Events:                 <none>

Again x minutes later, it finally did the deployment and reported ReconciliationSucceeded. Here's the Kustomization yaml:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: cert-manager
  namespace: flux-system
spec:
  interval: 10m
  targetNamespace: cert-manager
  sourceRef:
    kind: GitRepository
    name: infra-repo
  path: "./cert-manager"
  prune: true
  suspend: false
  patches:
    - patch: |-
        apiVersion: helm.toolkit.fluxcd.io/v2
        kind: HelmRelease
        metadata:
          name: cert-manager
        spec:
          releaseName: cert-manager
          chart:
            spec:
              version: "v1.19.1"

Note that I use spec.interval: 10 for all my Kustomizations and I don't use wait: anywhere. When I was adding additional resources to the cert-manager Kustumization, it again didn't immediately reconcile when source-manager read in the changes (perhaps it's not supposed to, because I made the changes in infra-repo GitRepository, from where it reads in HelmRelease, and not inside the flux-system cluster directory, where the kustomization itself is) and it certainly went way above the 10 minute interval period:

Events:
  Type    Reason       Age   From                  Message
  ----    ------       ----  ----                  -------
  Normal  ReconciliationSucceeded  35m   kustomize-controller  Reconciliation finished in 132.178339ms, next run in 10m0s
  Normal  Progressing              2m5s  kustomize-controller  ClusterIssuer/letsencrypt-production created
ClusterIssuer/letsencrypt-staging created
  Normal  ReconciliationSucceeded  2m5s  kustomize-controller  Reconciliation finished in 117.488053ms, next run in 10m0s 

Next run in 10m, but it actually took 33m. This lag was persistent, not temporary:

Events:
  Type    Reason       Age   From                  Message
  ----    ------       ----  ----                  -------
  Normal  Progressing  56m   kustomize-controller  ClusterIssuer/letsencrypt-production-dns created
ClusterIssuer/letsencrypt-staging-dns created
  Normal  ReconciliationSucceeded  56m    kustomize-controller  Reconciliation finished in 117.488053ms, next run in 10m0s
  Normal  ReconciliationSucceeded  37m    kustomize-controller  Reconciliation finished in 114.742958ms, next run in 10m0s
  Normal  ReconciliationSucceeded  24m    kustomize-controller  Reconciliation finished in 105.526995ms, next run in 10m0s
  Normal  ReconciliationSucceeded  5m31s  kustomize-controller  Reconciliation finished in 157.43214ms, next run in 10m0s

I checked other, older kustomizations, and saw the same happening there, they were all lagging.

The next day I did a test, first I timed manual kustomization reconciliations for cert-manager:

time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2

real    0m42.324s
user    0m0.068s
sys     0m0.043s
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2

real    0m10.311s
user    0m0.044s
sys     0m0.033s
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2

real    4m56.326s
user    0m0.238s
sys     0m0.079s
time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✗ context deadline exceeded

real    5m0.029s
user    0m0.213s
sys     0m0.105s

Varying intervals of lag, last one took more than 5 minutes, so it timed out.
Next I removed all the kustomizations that were failing, the 11 services, then timed manual reconciliation again:

time flux -n flux-system reconcile kustomization cert-manager
► annotating Kustomization cert-manager in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main@sha1:d10978a476b2326b655574247aa938d18b884ab2

real    0m2.351s
user    0m0.046s
sys     0m0.018s

real    0m2.359s
real    0m2.525s
real    0m2.360s
real    0m2.313s

It was consistently 2 seconds and the automatic reconcile interval was sticking to 10 minutes without lag:

Events:
  Type    Reason                   Age    From                  Message
  ----    ------                   ----   ----                  -------
  Normal  ReconciliationSucceeded  51m    kustomize-controller  Reconciliation finished in 107.841779ms, next run in 10m0s
  Normal  ReconciliationSucceeded  42m    kustomize-controller  Reconciliation finished in 119.475433ms, next run in 10m0s
  Normal  ReconciliationSucceeded  32m    kustomize-controller  Reconciliation finished in 107.592005ms, next run in 10m0s
  Normal  ReconciliationSucceeded  22m    kustomize-controller  Reconciliation finished in 112.545199ms, next run in 10m0s
  Normal  ReconciliationSucceeded  12m    kustomize-controller  Reconciliation finished in 164.414134ms, next run in 10m0s
  Normal  ReconciliationSucceeded  2m27s  kustomize-controller  Reconciliation finished in 101.173028ms, next run in 10m0s

In K8s cluster, flux is bootstrapped with version 2.5.1. Unfortunately I cannot really upgrade (not for a long time), because the K8s version itself is 1.30.11 and its upgrade is completely out of my hands.
I also checked the resource usage of pods during the issue, and they were well within the limits, the CPU usage was basically idle, memory was nowhere near the limit.

Steps to reproduce

Have around 10 failing kustomization healthchecks.
Try manual reconcile of a working kustomization and see how long it takes.

Expected behavior

Kustomization reconciliations shouldn't lag even if there are failing kustomizations.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v2.5.1

Flux check

► checking prerequisites
✗ flux 2.5.1 <2.7.3 (new CLI version is available, please upgrade)
✔ Kubernetes 1.30.11 >=1.30.0-0
► checking version in cluster
✔ distribution: flux-v2.5.1
✔ bootstrapped: true
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v1.2.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.5.1
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.5.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.5.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta3
✔ buckets.source.toolkit.fluxcd.io/v1
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1
✔ helmreleases.helm.toolkit.fluxcd.io/v2
✔ helmrepositories.source.toolkit.fluxcd.io/v1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta3
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions