Prometheus configuration and optimitzation for Netapp Harvest 2.0 #2563

oscarjim89 · 2023-12-21T17:18:33Z

oscarjim89
Dec 21, 2023

Hello,
I'm excited about Netapp Harvest 2.0 and with my team are planning to put it into production with the goal of renewing the version of Harvest 1.0 that we still have based on graphite and deployed on a server.
We want to take the opportunity to decommission the server and be able to run Netapp Harvest 2.0 on Kubernetes.

One of our concerns is the Prometheus deployment, since we don't have a lot of knowledge in this product and I'm worried about how we should size and configure it correctly so we don't have problems with memory.

Deploying a pooler for one of our clusters, I have seen that it has generated an exporter of 12MB, about 70000 series. Not bad at all! Extrapolating to all our clusters we would have to multiply this by 4.
Doing calculations of how much storage I need for my prometheus, I have thought that I need to do data aggregations in time to be able to maintain at least one year.
Something like this:

The first week I need performance metrics per minute.
For non-performance metrics (capacity, for example), I can have them by hours.
From this first week on, I can have hourly metrics only.
From one month on, one metric per day.
If I do the above, I might need about 200GB of local disk (otherwise, to store a year, I would need several TB of disk).

However, the problem that worries me is the memory, because I have been told that prometheus is not intended to store more than one or two weeks and if I were to store more I would need a lot of memory, not only to process the data but also to make queries.
How could I estimate the memory I would need to run Netapp Harvest? Do you have any example of prometheus configuration to optimize the storage (like the aggregation than I explained)? How is it usually done in other installations? Do you usually use a third-party (Thanos or whatever) to overflow more than two weeks?

On the other hand, I see that with Nabox, with the memory and disk requirements that has managed, you could save a history of 2 years!!! How do you do it? Is it worth more to install the NABOX OVA instead of deploying in kubernetes directly?

Am I worrying too much with a hypothetical prometheus monster when it should not be a problem? What is your opinion?

Best regards,
Oscar.

rahulguptajss · 2024-01-01T07:38:29Z

rahulguptajss
Jan 1, 2024
Collaborator

For long-term storage, consider integrating with Thanos, Cortex, or VictoriaMetrics. These systems are designed for durable, scalable, long-term storage of Prometheus metrics.

Prometheus Sizing reference is available here

VictoriaMetrics provides long-term storage capabilities and is designed to handle large amounts of data efficiently. It also supports downsampling, which reduces the amount of data stored over time, and it's compatible with Prometheus querying API, which means you can use Grafana or any other tool that works with Prometheus to visualize the data.

Thanos and Cortex, on the other hand, are not just long-term storage solutions. They provide a horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.

Thanos extends Prometheus for long-term storage while preserving Prometheus's query language. It's a good choice if you have multiple Prometheus instances and want to query them as if they were a single global instance. Thanos also supports downsampling and replication, which can improve query performance and reliability.

Cortex provides horizontally scalable, multi-tenant, long-term storage for Prometheus. It's designed to support very large metric volumes across multiple tenants. Cortex allows for scaling the metric ingestion rate and query load by adding more nodes to the cluster. It's a good choice if you need to support a large number of users or teams, each with their own Prometheus-style API.

NABox is a virtual appliance which facilitates the deployment of Harvest. The choice of deployment method ultimately depends on your personal preference. If you're more comfortable with Kubernetes, you can certainly opt for that. Some of our customers have found it necessary to expand their disk space in order to accommodate two years' worth of historical data in NABox.

You can refer #619, #863 for similiar questions.

6 replies

rahulguptajss Jan 4, 2024
Collaborator

Happy New Year @oscarjim89

Determining the appropriate amount of memory for your Prometheus instance depends on several factors. These include the number of targets you're monitoring, the number of metrics each target exposes, the scrape interval, and the retention period. In recent versions of Prometheus, significant improvements in memory usage have been made, as mentioned in an article here.

Prometheus is not designed for long-term storage, but the amount of data you want to collect from clusters also plays a role. The memory requirement largely depends on the number of objects, such as volumes and qtrees. For initial guidance on sizing your Prometheus instance, you might find this article here useful.

Don't put too much emphasis on NABox's ability to handle 2 years of data. We've had customers who had to increase memory in NABox due to a different set of loads. You might find a comparison between VictoriaMetrics and Prometheus interesting, as discussed in this article on Medium.

oscarjim89 Jan 9, 2024
Author

Thanks for your response!
So, after I read a little more about prometheus and just to understand it, in NAbox implementation... Is downsampling configured? What is the frequency of saving metrics? And the retention? It's the same (frequency and retention) for performance metrics than for the rest?

rahulguptajss Jan 10, 2024
Collaborator

NABox uses Prometheus, which does not natively support downsampling. The scrape interval is set to 1 minute, meaning that Prometheus stores these samples every minute. In NABox, the retention period for all ONTAP performance or configuration metrics is set to 2 years.

For more information, please refer to the NABox FAQ.

For details on loose scale numbers, visit the NABox Documentation on Compute Resources.

oscarjim89 Jan 10, 2024
Author

Thank you! Just to confirm...

So, if I'm doing well. If my exporter size is 42MB of data, will be:

42MB * 60 minutes (1 hour) * 24 hours (1 day) * 730 days (2 years) = 43115 GB

That it means that with NABox I will have to increase my storage and memory for sure and I will have to reduce data retention.
And with kubernetes containeritzation I could do downsampling, but I will have to use a third-party integration.

Are my deductions and calculations correct?

rahulguptajss Jan 10, 2024
Collaborator

To estimate the storage requirements for your Prometheus setup, use the formula:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

This formula is based on the guidelines provided in the Harvest documentation, StackExchange post and here

The ingested_samples_per_second can be calculated using the following PromQL query:

rate(prometheus_tsdb_head_samples_appended_total[2h])

This query gives the rate of samples ingested per second over the last 2 hours.

The bytes_per_sample can be calculated using the following PromQL query:

rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1d]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1d])

This query gives the average bytes per sample over the last day.

For example, if you have an ingestion rate of 4000 samples per second, each sample is 1 byte, and you want to retain data for 2 years (approximately 63,072,000 seconds), the needed disk space would be:

needed_disk_space = 63,072,000 seconds * 4000 samples/second * 1 byte/sample = 252,288,000,000 bytes

Converting this to a GiB, you get:

needed_disk_space = 252,288,000,000 bytes / (1024 * 1024 * 1024) = ~234 GiB

So, you would need approximately 234 GiB of storage to hold 2 years of data under these conditions.

⚠️ Note: These are rough calculations and do not take into account any optimizations that Prometheus might perform, such as deduplication or compression. The actual storage requirements may be less than these estimates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus configuration and optimitzation for Netapp Harvest 2.0 #2563

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Prometheus configuration and optimitzation for Netapp Harvest 2.0 #2563

Uh oh!

oscarjim89 Dec 21, 2023

Replies: 1 comment · 6 replies

Uh oh!

rahulguptajss Jan 1, 2024 Collaborator

Uh oh!

rahulguptajss Jan 4, 2024 Collaborator

Uh oh!

Uh oh!

oscarjim89 Jan 9, 2024 Author

Uh oh!

rahulguptajss Jan 10, 2024 Collaborator

Uh oh!

oscarjim89 Jan 10, 2024 Author

Uh oh!

Uh oh!

rahulguptajss Jan 10, 2024 Collaborator

oscarjim89
Dec 21, 2023

Replies: 1 comment 6 replies

rahulguptajss
Jan 1, 2024
Collaborator

rahulguptajss Jan 4, 2024
Collaborator

oscarjim89 Jan 9, 2024
Author

rahulguptajss Jan 10, 2024
Collaborator

oscarjim89 Jan 10, 2024
Author

rahulguptajss Jan 10, 2024
Collaborator