|
| 1 | +--- |
| 2 | +title: "Capacity Planning" |
| 3 | +linkTitle: "Capacity Planning" |
| 4 | +weight: 104 |
| 5 | +slug: capacity-planning |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | +You will want to estimate how many nodes are required, how many of |
| 10 | +each component to run, and how much storage space will be required. |
| 11 | +In practice, these will vary greatly depending on the metrics being |
| 12 | +sent to Cortex. |
| 13 | + |
| 14 | +Some key parameters are: |
| 15 | + |
| 16 | + 1. The number of active series. If you have Prometheus already you |
| 17 | + can query `prometheus_tsdb_head_series` to see this number. |
| 18 | + 2. Sampling rate, e.g. a new sample for each series every minute |
| 19 | + (the default Prometheus [scrape_interval](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)). |
| 20 | + Multiply this by the number of active series to get the |
| 21 | + total rate at which samples will arrive at Cortex. |
| 22 | + 3. The rate at which series are added and removed. This can be very |
| 23 | + high if you monitor objects that come and go - for example if you run |
| 24 | + thousands of batch jobs lasting a minute or so and capture metrics |
| 25 | + with a unique ID for each one. [Read how to analyse this on |
| 26 | + Prometheus](https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality). |
| 27 | + 4. How compressible the time-series data are. If a metric stays at |
| 28 | + the same value constantly, then Cortex can compress it very well, so |
| 29 | + 12 hours of data sampled every 15 seconds would be around 2KB. On |
| 30 | + the other hand if the value jumps around a lot it might take 10KB. |
| 31 | + There are not currently any tools available to analyse this. |
| 32 | + 5. How long you want to retain data for, e.g. 1 month or 2 years. |
| 33 | + |
| 34 | +Other parameters which can become important if you have particularly |
| 35 | +high values: |
| 36 | + |
| 37 | + 6. Number of different series under one metric name. |
| 38 | + 7. Number of labels per series. |
| 39 | + 8. Rate and complexity of queries. |
| 40 | + |
| 41 | +Now, some rules of thumb: |
| 42 | + |
| 43 | + 1. Each million series in an ingester takes 15GB of RAM. Total number |
| 44 | + of series in ingesters is number of active series times the |
| 45 | + replication factor. This is with the default of 12-hour chunks - RAM |
| 46 | + required will reduce if you set `-ingester.max-chunk-age` lower |
| 47 | + (trading off more back-end database IO) |
| 48 | + 2. Each million series (including churn) consumes 15GB of chunk |
| 49 | + storage and 4GB of index, per day (so multiply by the retention |
| 50 | + period). |
| 51 | + 3. Each 100,000 samples/sec arriving takes 1 CPU in distributors. |
| 52 | + Distributors don't need much RAM. |
| 53 | + |
| 54 | +If you turn on compression between distributors and ingesters (for |
| 55 | +example to save on inter-zone bandwidth charges at AWS/GCP) they will use |
| 56 | +significantly more CPU (approx 100% more for distributor and 50% more |
| 57 | +for ingester). |
0 commit comments