Skip to content

[Bug]: Node memory should not include terminated containers #1091

@bboreham

Description

@bboreham

What happened?

Prometheus, by default, will echo the last value of cAdvisor metrics for 5 minutes after they disappear.
This leads to artefacts where a pod restarts. To illustrate:

Image

So this query for example will sum them and double-count:

'sum(node_namespace_pod_container:container_memory_working_set_bytes{%(clusterLabel)s="$cluster", node=~"$node", container!=""}) by (pod)' % $._config,

This can be fixed by turning on track_timestamps_staleness, added to Prometheus in v2.48, but you could also amend the queries. That example could change to:

sum(max by (cluster, namespace, pod, container)(node_namespace_pod_container:container_memory_working_set_bytes{%(clusterLabel)s="$cluster", node=~"$node", container!=""})) by (pod)

Please provide any helpful snippets.

What parts of the codebase are affected?

Dashboards

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkeepaliveUse to prevent automatic closing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions