You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to compare different ways to structure the metrics for monitoring large scale infrastructure and I need some suggestions / help. Please see details below taking servers as an example:
Scope:
over 9,000 servers
40 metrics / timeseries per server
I also need to collect reference or metadata below
7 locations
2 environment categories - prod, non-prod
owned by approx 500 owner_ids
each server can have upto 2 user friendly aliases
Objective:
send metrics for all these servers across different locations to a single Cortex plant.
Owner of each server should be able to easily query or alert on the metrics for their servers by using conditions on attributes other than the server name (for eg "owner_id" (an ID that represents their organization that owns one or more server), location, aliases, etc)
Approaches I am considering fall in two broad categories:
a) enrich the metrics with additional labels for reference data or metadata at the time of collection, or
b) collect raw metrics without any labels and collect reference data separately through an info series and join the two in Cortex
Approach 1
Metric - (possible values 40)
Labels - server name (9000), aliases (max 2 per server), owner_id (max 1 per db but 500 overall), location (max 1 per db, 7 in total)
total timeseries - 40
Estimated cardinality per metric / timeseries - 18,000
At the time of data query or alerting, a user can use an exact condition on the metric name and can use regexes for aliases, owner_id and server names
This approach looks most intuitive to me for but I am not sure if a cardinality of 18K per series is good or bad.
Approach 2
Metric - server_ (possible values 500)
Labels - server name (max 100 per owner id, total 9000 globally), metric name (40), aliases (max 2 per db), location (max 1 per db, 7 in total)
total timeseries - 500
Cardinality per metric / timeseries - 100 x 40 x 2 = 8,000
At the time of data query or alerting, we would need to use a regex for metric name since it carries an ownerid and most of our users have multiple ownerids. We may still need to use regexes for aliases, and db names, along with an exact condition on metric name.
This approach looks less intuitive to me for end users because the actual name of metric is being captured as a label but it does reduce the cardinality to half.
Labels - owner_id (max 1 per server), metric name (40), aliases (max 2 per server), location (max 1 per db, 7 in total)
total timeseries - 9000
cardinality per metric / timeseries - 80
This approach looks least intuitive to me for end users but has least cardinality as well.
Approach 4
Metric name 1 - (40 possible values)
labels - server name
Info series - owner_id
labels - server name, location, aliases
Each user will need to join the two in order to view or alert on their metrics. Also, if there are one to many relationships between a db name and owner_id, we have seen that the joins sometimes fail
My questions:
am I calculating cardinality correctly?
What approach is better?
how do I determine the cardinality value that can become problem for a cortex plant?
is it generally better to have more timeseries with less cardinality than less timeseries with more cardinality?
I understand that with approach 4, we will likely have lower storage and ingestion overhead but wouldn't the frequent joins affect querier?
This discussion was converted from issue #7034 on September 26, 2025 17:06.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to compare different ways to structure the metrics for monitoring large scale infrastructure and I need some suggestions / help. Please see details below taking servers as an example:
Scope:
I also need to collect reference or metadata below
Objective:
Approaches I am considering fall in two broad categories:
a) enrich the metrics with additional labels for reference data or metadata at the time of collection, or
b) collect raw metrics without any labels and collect reference data separately through an info series and join the two in Cortex
Approach 1
Approach 2
Approach 3
Approach 4
My questions:
Thanks for your attention and help in advance!
Beta Was this translation helpful? Give feedback.
All reactions