Help scaling prometheus deploy #9162
-
Hello, The team I'm in is having trouble setting up linkerd-viz. Whenever we upscale the prometheus deployment, it ends up requesting all of the node's memory, resulting in it being OOMKilled. Setting it to use more pods has the same results. Our cluster has around 2000 pods wiht a lot of internal traffic. Is there a way to make the prometheus that comes with linkerd-viz scale to properly handle this workload? I tried searching for discussions about this, but the most similar one had no answers: #6087. Linkerd version: 2.11.4 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
These are largely Prometheus questions more than Linkerd questions. From the Linkerd perspective, you could give Prom more memory; you could use an off-cluster Prometheus with bigger capacity; you could use a third-party metrics provider; or you could alert linkerd-viz's scrape config to only store a subset of metrics. Either way, note that the linkerd-viz deploys a single, in-memory instance of Prometheus that does not preserve data beyond restarts so if you are planning on relying on this these metrics for important things, e.g. incident management, you may want to look at other options. |
Beta Was this translation helpful? Give feedback.
These are largely Prometheus questions more than Linkerd questions.
From the Linkerd perspective, you could give Prom more memory; you could use an off-cluster Prometheus with bigger capacity; you could use a third-party metrics provider; or you could alert linkerd-viz's scrape config to only store a subset of metrics. Either way, note that the linkerd-viz deploys a single, in-memory instance of Prometheus that does not preserve data beyond restarts so if you are planning on relying on this these metrics for important things, e.g. incident management, you may want to look at other options.