Help scaling prometheus deploy #9162

SCHrodrigo · 2022-08-15T12:38:02Z

SCHrodrigo
Aug 15, 2022

Hello,

The team I'm in is having trouble setting up linkerd-viz.

Whenever we upscale the prometheus deployment, it ends up requesting all of the node's memory, resulting in it being OOMKilled. Setting it to use more pods has the same results.

Our cluster has around 2000 pods wiht a lot of internal traffic.

Is there a way to make the prometheus that comes with linkerd-viz scale to properly handle this workload?

I tried searching for discussions about this, but the most similar one had no answers: #6087.

Linkerd version: 2.11.4
K8s version: v1.22.12-gke.500

Answered by wmorgan

Aug 15, 2022

These are largely Prometheus questions more than Linkerd questions.

From the Linkerd perspective, you could give Prom more memory; you could use an off-cluster Prometheus with bigger capacity; you could use a third-party metrics provider; or you could alert linkerd-viz's scrape config to only store a subset of metrics. Either way, note that the linkerd-viz deploys a single, in-memory instance of Prometheus that does not preserve data beyond restarts so if you are planning on relying on this these metrics for important things, e.g. incident management, you may want to look at other options.

View full answer

wmorgan · 2022-08-15T14:24:13Z

wmorgan
Aug 15, 2022
Maintainer

These are largely Prometheus questions more than Linkerd questions.

From the Linkerd perspective, you could give Prom more memory; you could use an off-cluster Prometheus with bigger capacity; you could use a third-party metrics provider; or you could alert linkerd-viz's scrape config to only store a subset of metrics. Either way, note that the linkerd-viz deploys a single, in-memory instance of Prometheus that does not preserve data beyond restarts so if you are planning on relying on this these metrics for important things, e.g. incident management, you may want to look at other options.

1 reply

SCHrodrigo Aug 15, 2022
Author

The pod consumes all of the node's memory, and even then it ends up in OOMK. So we are going to dedicate a team to configure an off-cluster instance.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help scaling prometheus deploy #9162

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help scaling prometheus deploy #9162

Uh oh!

SCHrodrigo Aug 15, 2022

Replies: 1 comment · 1 reply

Uh oh!

wmorgan Aug 15, 2022 Maintainer

Uh oh!

SCHrodrigo Aug 15, 2022 Author

SCHrodrigo
Aug 15, 2022

Replies: 1 comment 1 reply

wmorgan
Aug 15, 2022
Maintainer

SCHrodrigo Aug 15, 2022
Author