Skip to content

bug: agent arbitrarily hanging #266

@serhatcetinkaya

Description

@serhatcetinkaya

Hello,

We have been using coroot-node-agent in our kubernetes clusters as a daemonset for some time. We configured it to send traces to tempo which is also running locally in the same cluster and we scrape metrics from :80/metrics with prometheus.

On low traffic environments everything is ok, when we deploy the agent to a cluster that has high traffic we occasionally observe that some pods become unresponsive temporarily. What we observe so far:

  • Amount of traces sent to tempo goes down
  • Prometheus requests coming to metrics endpoint hangs and prometheus fires exporter down alert
  • CPU usage goes down
  • There are no error logs
  • Listener on :80 is still alive, but hangs when I make a request to test

When I port forward to my local and make an HTTP GET to /metrics endpoint of problematic pod, I see that curl hangs for 2-3 minutes and I get a response that has all the metrics. After I get a response I see that the pod prints lots of error logs like:

E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe

I assume these happen because of prometheus. It creates a connection to :80 and after $scrape_timeout seconds it closes the connection.

After this happened, I checked previous issues and saw that CPU and memory profiles are requested. I don't have them right now, when the issue happens again I will get them and attach to this issue.

You can see our manifest here:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: coroot-node-agent-daemonset
spec:
  minReadySeconds: 0
  revisionHistoryLimit: 2
  template:
    metadata:
      annotations:
        prometheus.io/scrape_0_enabled: "true"
        prometheus.io/scrape_0_interval: 60s
        prometheus.io/scrape_0_path: /metrics
        prometheus.io/scrape_0_port: "80"
        prometheus.io/scrape_0_scheme: HTTP
        prometheus.io/scrape_0_timeout: 50s
    spec:
      containers:
        - args:
            - --cgroupfs-root
            - /host/sys/fs/cgroup
          env:
            - name: TRACES_ENDPOINT
              value: http://tempo-main-service:4318/v1/traces
            - name: TRACES_SAMPLING
              value: "0.05"
            - name: DISABLE_LOG_PARSING
              value: "true"
            - name: CGROUPFS_ROOT
              value: /host/sys/fs/cgroup
            - name: MAX_SPOOL_SIZE
              value: 64MB
            - name: SCRAPE_INTERVAL
              value: 60s
          image: ghcr.io/coroot/coroot-node-agent:1.26.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            initialDelaySeconds: 0
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 80
            timeoutSeconds: 1
          name: coroot-node-agent
          ports:
            - containerPort: 80
              name: metrics
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 0
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 80
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 500m
              memory: 1024Mi
            requests:
              cpu: 100m
              memory: 128Mi
          securityContext:
            allowPrivilegeEscalation: true
            privileged: true
            readOnlyRootFilesystem: true
            runAsNonRoot: false
          startupProbe:
            failureThreshold: 3
            initialDelaySeconds: 0
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 80
            timeoutSeconds: 1
          volumeMounts:
            - mountPath: /host/sys/fs/cgroup
              name: cgroup-fs
              readOnly: true
            - mountPath: /sys/kernel/debug
              name: debug-fs
              readOnly: false
      dnsPolicy: ClusterFirst
      hostNetwork: false
      hostPID: true
      restartPolicy: Always
      securityContext:
        fsGroupChangePolicy: Always
        runAsNonRoot: false
      setHostnameAsFQDN: false
      terminationGracePeriodSeconds: 30
      tolerations:
        - effect: NoExecute
          key: ""
          operator: Exists
        - effect: NoSchedule
          key: ""
          operator: Exists
      volumes:
        - hostPath:
            path: /sys/fs/cgroup
            type: ""
          name: cgroup-fs
        - hostPath:
            path: /sys/kernel/debug
            type: ""
          name: debug-fs
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 25%
    type: RollingUpdate

You can see the metrics from prometheus at the time we had the problem in attached screenshots, the issue starts around 16:12 and problematic pod is coroot-node-agent-daemonset-6k8vn which is obvious in screenshots. You can ignore what is going on after 16:25 I had to intervene at that time and delete all the pods in the daemonset:

Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions