Requesting guidance - cardinality and metric structure for metrics for a large number of entities #7037

kk2365 · 2025-09-25T10:57:17Z

kk2365
Sep 25, 2025

I am trying to compare different ways to structure the metrics for monitoring large scale infrastructure and I need some suggestions / help. Please see details below taking servers as an example:

Scope:

over 9,000 servers
40 metrics / timeseries per server

I also need to collect reference or metadata below

7 locations
2 environment categories - prod, non-prod
owned by approx 500 owner_ids
each server can have upto 2 user friendly aliases

Objective:

send metrics for all these servers across different locations to a single Cortex plant.
Owner of each server should be able to easily query or alert on the metrics for their servers by using conditions on attributes other than the server name (for eg "owner_id" (an ID that represents their organization that owns one or more server), location, aliases, etc)

Approaches I am considering fall in two broad categories:
a) enrich the metrics with additional labels for reference data or metadata at the time of collection, or
b) collect raw metrics without any labels and collect reference data separately through an info series and join the two in Cortex

Approach 1

Metric - (possible values 40)
Labels - server name (9000), aliases (max 2 per server), owner_id (max 1 per db but 500 overall), location (max 1 per db, 7 in total)
total timeseries - 40
Estimated cardinality per metric / timeseries - 18,000
At the time of data query or alerting, a user can use an exact condition on the metric name and can use regexes for aliases, owner_id and server names
This approach looks most intuitive to me for but I am not sure if a cardinality of 18K per series is good or bad.

Approach 2

Metric - server_ (possible values 500)
Labels - server name (max 100 per owner id, total 9000 globally), metric name (40), aliases (max 2 per db), location (max 1 per db, 7 in total)
total timeseries - 500
Cardinality per metric / timeseries - 100 x 40 x 2 = 8,000
At the time of data query or alerting, we would need to use a regex for metric name since it carries an ownerid and most of our users have multiple ownerids. We may still need to use regexes for aliases, and db names, along with an exact condition on metric name.
This approach looks less intuitive to me for end users because the actual name of metric is being captured as a label but it does reduce the cardinality to half.

Approach 3

Metric - server_<server_name> (possible values 9000)
Labels - owner_id (max 1 per server), metric name (40), aliases (max 2 per server), location (max 1 per db, 7 in total)
total timeseries - 9000
cardinality per metric / timeseries - 80
This approach looks least intuitive to me for end users but has least cardinality as well.

Approach 4

Metric name 1 - (40 possible values)
labels - server name
Info series - owner_id
labels - server name, location, aliases
Each user will need to join the two in order to view or alert on their metrics. Also, if there are one to many relationships between a db name and owner_id, we have seen that the joins sometimes fail

My questions:

am I calculating cardinality correctly?
What approach is better?
how do I determine the cardinality value that can become problem for a cortex plant?
is it generally better to have more timeseries with less cardinality than less timeseries with more cardinality?
I understand that with approach 4, we will likely have lower storage and ingestion overhead but wouldn't the frequent joins affect querier?

Thanks for your attention and help in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Requesting guidance - cardinality and metric structure for metrics for a large number of entities #7037

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Requesting guidance - cardinality and metric structure for metrics for a large number of entities #7037

Uh oh!

kk2365 Sep 25, 2025

Replies: 0 comments

kk2365
Sep 25, 2025