[VC-41078] Reduce memory usage by removing the Replicaset data gatherer from the default config by wallrj · Pull Request #658 · jetstack/jetstack-secure

wallrj · 2025-05-28T09:08:15Z

Replicaset resources are ignored by the TLSPK backend service so there's no point collecting them.

In a Kind cluster with 20k Replicasets, this reduced the peak resident set memory usage from 522 MB to 85 MB.

See testing section below. Other resources: 53 Nodes, 52 Deployments, 350 Pods, 4 Secrets

On a busy cluster where there are frequent Helm upgrades and / or deployment rollouts, there may be a large number of Replicaset resources for previous revisions of each Deployment. This depends on the Deployment revisionHistoryLimit, which is 10 by default.
It may also depend on the helm upgrade --history-max value, which is 10 by default:

$ helm upgrade --help
...
      --history-max int                            limit the maximum number of revisions saved per release. Use 0 for no limit (default 10)

xref: https://venafi.atlassian.net/browse/VC-41078

Testing

Before:

$ cat /proc/$(pidof preflight)/status | grep Vm
VmPeak:  1753432 kB
VmSize:  1753432 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    522736 kB
VmRSS:    403084 kB
VmData:   532024 kB
VmStk:       132 kB
VmExe:     26984 kB
VmLib:         8 kB
VmPTE:      1156 kB
VmSwap:        0 kB

After:

$ cat /proc/$(pidof preflight)/status | grep Vm
VmPeak:  1287844 kB
VmSize:  1287844 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     65176 kB
VmRSS:     65176 kB
VmData:    66436 kB
VmStk:       132 kB
VmExe:     26984 kB
VmLib:         8 kB
VmPTE:       248 kB
VmSwap:        0 kB

# After running for some minutes

$ cat /proc/$(pidof preflight)/status | grep Vm
VmPeak:  1287844 kB
VmSize:  1287844 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     77720 kB
VmRSS:     76424 kB
VmData:    82820 kB
VmStk:       132 kB
VmExe:     26984 kB
VmLib:         8 kB
VmPTE:       276 kB
VmSwap:        0 kB

# Sometime later 

$ cat /proc/$(pidof preflight)/status | grep Vm
VmPeak:  1288100 kB
VmSize:  1288100 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     84828 kB
VmRSS:     84828 kB
VmData:    87172 kB
VmStk:       132 kB
VmExe:     26984 kB
VmLib:         8 kB
VmPTE:       284 kB
VmSwap:        0 kB

Create a Kind cluster:

$ kind version
kind v0.27.0 go1.23.6 linux/amd64

$ kind create cluster
...

$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.32.2

Deploy venafi-kubernetes-agent:

$ venctl installation cluster connect \
    --name "richardw-cluster-connect-test-20" \
    --api-key $VEN_API_KEY \
    --no-promptts \
    --owning-team RichardW

Remove memory limit:

kubectl patch deployment venafi-kubernetes-agent \
    -n venafi \
    --type=json \
    -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/resources/limits"}]'

Enable pprof:

kubectl patch deployment venafi-kubernetes-agent \
        -n venafi \
        --type=json \
        -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--enable-pprof"}]'

Use kwok to allow me to create some fake nodes and realistic pods:

$ kwok \
   --kubeconfig=~/.kube/config \
   --manage-all-nodes=false \
   --manage-nodes-with-annotation-selector=kwok.x-k8s.io/node=fake \
   --manage-nodes-with-label-selector= \
   --manage-single-node= \
   --cidr=10.0.0.1/24 \
   --node-ip=10.0.0.1 \
   --node-lease-duration-seconds=40

Create ~45 nodes and ~45 Deployments by running the following command repeatedly:

kubectl create -f - <<EOF
apiVersion: v1
kind: Node
metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    kwok.x-k8s.io/node: fake
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    kubernetes.io/role: agent
    node-role.kubernetes.io/agent: ""
    type: kwok
  generateName: kwok-node-
spec:
  taints: # Avoid scheduling actual running pods to fake Node
  - effect: NoSchedule
    key: kwok.x-k8s.io/node
    value: fake
status:
  allocatable:
    cpu: 32
    memory: 256Gi
    pods: 110
  capacity:
    cpu: 32
    memory: 256Gi
    pods: 110
  nodeInfo:
    architecture: amd64
    bootID: ""
    containerRuntimeVersion: ""
    kernelVersion: ""
    kubeProxyVersion: fake
    kubeletVersion: fake
    machineID: ""
    operatingSystem: linux
    osImage: ""
    systemUUID: ""
  phase: Running
EOF


kubectl create -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  generateName: fake-
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 1000
  selector:
    matchLabels:
      app: fake-pod
  template:
    metadata:
      labels:
        app: fake-pod
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: type
                operator: In
                values:
                - kwok
      # A taints was added to an automatically created Node.
      # You can remove taints of Node or add this tolerations.
      tolerations:
      - key: "kwok.x-k8s.io/node"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: fake-container
        image: fake-image
EOF

Run kubectl rollout restart deployment in a while loop, to start creating replicasets:

while kubectl rollout restart deploy; do sleep 1; done

This created ~20k Replicasets in the default namespace:

$ kubectl get --raw "/metrics" | grep apiserver_storage_objects | grep replicasets
apiserver_storage_objects{resource="replicasets.apps"} 20000

$ kubectl get rs -oname  | wc -l
19987

Which is ~80Mi of JSON data:

$ kubectl get rs -o json > replicasets.json
$ ls -lrth replicasets.json
-rw-r--r-- 1 richard richard 80M May 29 09:36 replicasets.json

Measure the virtual memory usage of venafi-kubernetes-agent process (preflight),
by reading from /proc/status

$ cat /proc/$(pidof preflight)/status | grep Vm
VmPeak:  1753432 kB
VmSize:  1753432 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    522736 kB
VmRSS:    505508 kB
VmData:   532024 kB
VmStk:       132 kB
VmExe:     26984 kB
VmLib:         8 kB
VmPTE:      1156 kB
VmSwap:        0 kB

Note the VmHWM value which is "peak resident set size (“high water mark”)".

Remove the replicaset datagatherer from the configmap:

helm get values venafi-kubernetes-agent -n venafi  -o yaml > values.yaml
helm template venafi-kubernetes-agent deploy/charts/venafi-kubernetes-agent \
    -n venafi \
     --show-only templates/configmap.yaml \
      --show-only templates/rbac.yaml \
       --values values.yaml \
| kubectl apply -f -

Restart venafi-kubernetes-agent:

kubectl rollout restart deploy -n venafi venafi-kubernetes-agent

$ cat /proc/$(pidof preflight)/status | grep Vm
VmPeak:  1287844 kB
VmSize:  1287844 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     77720 kB
VmRSS:     76424 kB
VmData:    82820 kB
VmStk:       132 kB
VmExe:     26984 kB
VmLib:         8 kB
VmPTE:       276 kB
VmSwap:        0 kB

Signed-off-by: Richard Wall <richard.wall@cyberark.com>

wallrj · 2025-05-28T09:17:00Z

deploy/charts/venafi-kubernetes-agent/templates/configmap.yaml

-        resource-type:
-          version: v1
-          resource: replicasets
-          group: apps


There is no dedicated Role or ClusterRole for this resource, because there is a ClusterRoleBinding to the standard cluster-view ClusterRole:

jetstack-secure/deploy/charts/venafi-kubernetes-agent/templates/rbac.yaml

Lines 28 to 41 in 1bf4bca

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRoleBinding

metadata:

name: {{ include "venafi-kubernetes-agent.fullname" . }}-cluster-viewer

labels:

{{- include "venafi-kubernetes-agent.labels" . | nindent 4 }}

roleRef:

apiGroup: rbac.authorization.k8s.io

kind: ClusterRole

name: view

subjects:

- kind: ServiceAccount

name: {{ include "venafi-kubernetes-agent.serviceAccountName" . }}

namespace: {{ .Release.Namespace }}

wallrj · 2025-05-29T09:55:57Z

I've updated the PR description with some information about how I simulated a large cluster with many replicasets and measured the peak memory usage before and after removing the replicaset datagatherer.

Perhaps in future we can automate those steps.

I will attach some heap profiles and metrics to the Jira issue so that we can try and understand why so much memory is used by the agent.

I also ran the ./hack/e2e/test.sh script and observed the test pass.

Remove ReplicaSet from the default config

2c8c733

Signed-off-by: Richard Wall <richard.wall@cyberark.com>

wallrj commented May 28, 2025

View reviewed changes

inteon approved these changes May 28, 2025

View reviewed changes

wallrj changed the title ~~WIP: [VC-41078] Remove ReplicaSet from the default config~~ WIP: [VC-41078] Reduce memory usage by removing the Replicaset data gatherer from the default config May 29, 2025

wallrj changed the title ~~WIP: [VC-41078] Reduce memory usage by removing the Replicaset data gatherer from the default config~~ [VC-41078] Reduce memory usage by removing the Replicaset data gatherer from the default config May 29, 2025

wallrj merged commit 8c2091c into master May 29, 2025
3 of 4 checks passed

wallrj deleted the remove-replicasets branch May 29, 2025 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[VC-41078] Reduce memory usage by removing the Replicaset data gatherer from the default config#658

[VC-41078] Reduce memory usage by removing the Replicaset data gatherer from the default config#658
wallrj merged 1 commit intomasterfrom
remove-replicasets

wallrj commented May 28, 2025 •

edited

Loading

Uh oh!

wallrj May 28, 2025

Uh oh!

wallrj commented May 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	apiVersion: rbac.authorization.k8s.io/v1
	kind: ClusterRoleBinding
	metadata:
	name: {{ include "venafi-kubernetes-agent.fullname" . }}-cluster-viewer
	labels:
	{{- include "venafi-kubernetes-agent.labels" . \| nindent 4 }}
	roleRef:
	apiGroup: rbac.authorization.k8s.io
	kind: ClusterRole
	name: view
	subjects:
	- kind: ServiceAccount
	name: {{ include "venafi-kubernetes-agent.serviceAccountName" . }}
	namespace: {{ .Release.Namespace }}

Comments

Conversation

wallrj commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

wallrj May 28, 2025

Choose a reason for hiding this comment

Uh oh!

wallrj commented May 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wallrj commented May 28, 2025 •

edited

Loading