bug: agent arbitrarily hanging

Hello,

We have been using coroot-node-agent in our kubernetes clusters as a daemonset for some time. We configured it to send traces to tempo which is also running locally in the same cluster and we scrape metrics from `:80/metrics` with prometheus. 

On low traffic environments everything is ok, when we deploy the agent to a cluster that has high traffic we occasionally observe that some pods become unresponsive temporarily. What we observe so far:

- Amount of traces sent to tempo goes down
- Prometheus requests coming to metrics endpoint hangs and prometheus fires exporter down alert
- CPU usage goes down
- There are no error logs
- Listener on `:80` is still alive, but hangs when I make a request to test

When I port forward to my local and make an HTTP GET to `/metrics` endpoint of problematic pod, I see that curl hangs for 2-3 minutes and I get a response that has all the metrics. After I get a response I see that the pod prints lots of error logs like:

```
E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720  869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
```

I assume these happen because of prometheus. It creates a connection to `:80` and after `$scrape_timeout` seconds it closes the connection.

After this happened, I checked previous issues and saw that CPU and memory profiles are requested. I don't have them right now, when the issue happens again I will get them and attach to this issue. 

You can see our manifest here:

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: coroot-node-agent-daemonset
spec:
  minReadySeconds: 0
  revisionHistoryLimit: 2
  template:
    metadata:
      annotations:
        prometheus.io/scrape_0_enabled: "true"
        prometheus.io/scrape_0_interval: 60s
        prometheus.io/scrape_0_path: /metrics
        prometheus.io/scrape_0_port: "80"
        prometheus.io/scrape_0_scheme: HTTP
        prometheus.io/scrape_0_timeout: 50s
    spec:
      containers:
        - args:
            - --cgroupfs-root
            - /host/sys/fs/cgroup
          env:
            - name: TRACES_ENDPOINT
              value: http://tempo-main-service:4318/v1/traces
            - name: TRACES_SAMPLING
              value: "0.05"
            - name: DISABLE_LOG_PARSING
              value: "true"
            - name: CGROUPFS_ROOT
              value: /host/sys/fs/cgroup
            - name: MAX_SPOOL_SIZE
              value: 64MB
            - name: SCRAPE_INTERVAL
              value: 60s
          image: ghcr.io/coroot/coroot-node-agent:1.26.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            initialDelaySeconds: 0
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 80
            timeoutSeconds: 1
          name: coroot-node-agent
          ports:
            - containerPort: 80
              name: metrics
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 0
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 80
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 500m
              memory: 1024Mi
            requests:
              cpu: 100m
              memory: 128Mi
          securityContext:
            allowPrivilegeEscalation: true
            privileged: true
            readOnlyRootFilesystem: true
            runAsNonRoot: false
          startupProbe:
            failureThreshold: 3
            initialDelaySeconds: 0
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 80
            timeoutSeconds: 1
          volumeMounts:
            - mountPath: /host/sys/fs/cgroup
              name: cgroup-fs
              readOnly: true
            - mountPath: /sys/kernel/debug
              name: debug-fs
              readOnly: false
      dnsPolicy: ClusterFirst
      hostNetwork: false
      hostPID: true
      restartPolicy: Always
      securityContext:
        fsGroupChangePolicy: Always
        runAsNonRoot: false
      setHostnameAsFQDN: false
      terminationGracePeriodSeconds: 30
      tolerations:
        - effect: NoExecute
          key: ""
          operator: Exists
        - effect: NoSchedule
          key: ""
          operator: Exists
      volumes:
        - hostPath:
            path: /sys/fs/cgroup
            type: ""
          name: cgroup-fs
        - hostPath:
            path: /sys/kernel/debug
            type: ""
          name: debug-fs
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 25%
    type: RollingUpdate
```

You can see the metrics from prometheus at the time we had the problem in attached screenshots, the issue starts around 16:12 and problematic pod is `coroot-node-agent-daemonset-6k8vn` which is obvious in screenshots. You can ignore what is going on after 16:25 I had to intervene at that time and delete all the pods in the daemonset:

<img width="2565" height="882" alt="Image" src="https://github.com/user-attachments/assets/4260e451-9e04-472c-93df-24c215c0b306" />
<img width="2569" height="930" alt="Image" src="https://github.com/user-attachments/assets/2f2dec77-aa07-4d19-8537-32a0b81f35fd" />
<img width="2562" height="923" alt="Image" src="https://github.com/user-attachments/assets/bbe36aaa-b885-4729-89f2-49e5248eaa84" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: agent arbitrarily hanging #266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: agent arbitrarily hanging #266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions