-
Notifications
You must be signed in to change notification settings - Fork 85
Description
Hello,
We have been using coroot-node-agent in our kubernetes clusters as a daemonset for some time. We configured it to send traces to tempo which is also running locally in the same cluster and we scrape metrics from :80/metrics with prometheus.
On low traffic environments everything is ok, when we deploy the agent to a cluster that has high traffic we occasionally observe that some pods become unresponsive temporarily. What we observe so far:
- Amount of traces sent to tempo goes down
- Prometheus requests coming to metrics endpoint hangs and prometheus fires exporter down alert
- CPU usage goes down
- There are no error logs
- Listener on
:80is still alive, but hangs when I make a request to test
When I port forward to my local and make an HTTP GET to /metrics endpoint of problematic pod, I see that curl hangs for 2-3 minutes and I get a response that has all the metrics. After I get a response I see that the pod prints lots of error logs like:
E0105 13:23:53.538909 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:53.538909 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:33862: write: broken pipe
E0105 13:23:54.066131 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:54.066131 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:54516: write: broken pipe
E0105 13:23:55.143771 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.143771 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.2.252:53082: write: broken pipe
E0105 13:23:55.571720 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
E0105 13:23:55.571720 869465 main.go:193] error encoding and sending metric family: write tcp 10.57.8.224:80->10.57.11.104:38062: write: broken pipe
I assume these happen because of prometheus. It creates a connection to :80 and after $scrape_timeout seconds it closes the connection.
After this happened, I checked previous issues and saw that CPU and memory profiles are requested. I don't have them right now, when the issue happens again I will get them and attach to this issue.
You can see our manifest here:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: coroot-node-agent-daemonset
spec:
minReadySeconds: 0
revisionHistoryLimit: 2
template:
metadata:
annotations:
prometheus.io/scrape_0_enabled: "true"
prometheus.io/scrape_0_interval: 60s
prometheus.io/scrape_0_path: /metrics
prometheus.io/scrape_0_port: "80"
prometheus.io/scrape_0_scheme: HTTP
prometheus.io/scrape_0_timeout: 50s
spec:
containers:
- args:
- --cgroupfs-root
- /host/sys/fs/cgroup
env:
- name: TRACES_ENDPOINT
value: http://tempo-main-service:4318/v1/traces
- name: TRACES_SAMPLING
value: "0.05"
- name: DISABLE_LOG_PARSING
value: "true"
- name: CGROUPFS_ROOT
value: /host/sys/fs/cgroup
- name: MAX_SPOOL_SIZE
value: 64MB
- name: SCRAPE_INTERVAL
value: 60s
image: ghcr.io/coroot/coroot-node-agent:1.26.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 0
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 80
timeoutSeconds: 1
name: coroot-node-agent
ports:
- containerPort: 80
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 0
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 80
timeoutSeconds: 1
resources:
limits:
cpu: 500m
memory: 1024Mi
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: true
privileged: true
readOnlyRootFilesystem: true
runAsNonRoot: false
startupProbe:
failureThreshold: 3
initialDelaySeconds: 0
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 80
timeoutSeconds: 1
volumeMounts:
- mountPath: /host/sys/fs/cgroup
name: cgroup-fs
readOnly: true
- mountPath: /sys/kernel/debug
name: debug-fs
readOnly: false
dnsPolicy: ClusterFirst
hostNetwork: false
hostPID: true
restartPolicy: Always
securityContext:
fsGroupChangePolicy: Always
runAsNonRoot: false
setHostnameAsFQDN: false
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: ""
operator: Exists
- effect: NoSchedule
key: ""
operator: Exists
volumes:
- hostPath:
path: /sys/fs/cgroup
type: ""
name: cgroup-fs
- hostPath:
path: /sys/kernel/debug
type: ""
name: debug-fs
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 25%
type: RollingUpdateYou can see the metrics from prometheus at the time we had the problem in attached screenshots, the issue starts around 16:12 and problematic pod is coroot-node-agent-daemonset-6k8vn which is obvious in screenshots. You can ignore what is going on after 16:25 I had to intervene at that time and delete all the pods in the daemonset:
