-
Notifications
You must be signed in to change notification settings - Fork 8
Description
hey; we've recently deployed kubenetmon on production and we're dealing with many labeling failures with the above moessages.
the typical reasons are as follows:
- pod/Running, pod/Failed (failed pods)
- pod/Running, pod/Succeeded (cronjobs)
one possible idea i can think of; is,
if only one pod is in Running state, and (now - pod.status.startTime) > 120s; then assume that's the only remaining pod that can be labelled.
--
i was looking at some of the tcp timeout values to help arrive at this number;
my fundamental assumption for this threshold is assuming the succeeded/failed pods shall close their connections before exiting. i have not actually validated this though..
relevant entries from my sysctl are as follows (AKS);
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 3600
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
--
sidenote: the current log lines are %+v, which are very very verbose (entire pod object), unnecessarily?; i had to tone it down with something like this;
builder := strings.Builder{}
for _, pod := range dstPods {
builder.WriteString(fmt.Sprintf(" %s/%s/%s", pod.Namespace, pod.Name, pod.Status.Phase))
}
return nil, nil, fmt.Errorf("more than one pod maps to replySrc IP %v:%s", dstEndpointInfo.ip.String(), builder.String())