elastic: add prirep and index tags to the elasticsearch.shards metric#22426
elastic: add prirep and index tags to the elasticsearch.shards metric#22426
prirep and index tags to the elasticsearch.shards metric#22426Conversation
bf969f7 to
fc8aa6b
Compare
|
|
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
iliakur
left a comment
There was a problem hiding this comment.
minor stuff, left more comments in direct communication
| # Map p/r to primary/replica for better readability | ||
| prirep = 'primary' if prirep_raw == 'p' else 'replica' | ||
|
|
||
| key = (node, index, prirep) |
There was a problem hiding this comment.
Is it fine if index and prirep are None?
There was a problem hiding this comment.
Good catch, I added a defensive skip in case index or prirep are None + handle the case where a shard is in Relocating state. Will clarify in description.
Co-authored-by: Ilia Kurenkov <ilia.kurenkov@datadoghq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9bca67034
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
elastic/changelog.d/22426.changed
Outdated
| @@ -0,0 +1 @@ | |||
| elastic: add `prirep` and `index` tags to the `elasticsearch.shards` metric | |||
There was a problem hiding this comment.
Use non-breaking changelog type for this additive feature
This entry is filed as .changed, which (per /workspace/integrations-core/AGENTS.md) is reserved for breaking/significant changes and triggers a major version bump; this PR adds an opt-in config/metric-tagging feature and does not introduce a breaking change, so keeping this file type will incorrectly force a major release for elastic. Please regenerate the changelog entry with an appropriate non-breaking type (for example added).
Useful? React with 👍 / 👎.
What does this PR do?
Adds the
detailed_shard_metricsoption to get more details on theelasticsearch.shardsmetric.When enabled (and
cat_allocation_statsis enabled too), addsindexandprirep(primary/replica) tags to theelasticsearch.shardsmetric. This provides detailed shard placement visibility.Changes:
/_cat/shardsExample metric series:
elasticsearch.shards{node_name:node-1, index:250127-logs, prirep:primary}Motivation
We need to monitor shard placement distribution across Elasticsearch clusters to better detect:
The existing aggregated metric only showed total shards per node, making it impossible to identify which indices were causing imbalance or whether primaries vs replicas were distributed properly. This would help us confim suspicions of hot spot nodes.
Having this level of detail in metrics helps for troubleshooting immediately (visualizing the current state just like with hitting APIs directly), but also during RCA by looking at historical data (which APIs don't provide).
Usage
With these tags, operators can query:
Technical details
The
_cat/shardsAPI returns all shards in the cluster, along with their state. Here, we are counting the ones that are in two states:STARTEDandRELOCATING.STARTEDis straightforward, the shard is assigned to a node and we count it as such.RELOCATINGreturns a value fornodesuch as:"source -> ip uid target". In this case, we try parsing this string to count the shard towards the source node. When the relocation is done, it will beSTARTED, withnode: target, and the timeseries will be updated as they should.The other states are:
INITIALIZING: shard is recovering and will then move toSTARTEDUNASSIGNED: shard is not assigned to any nodeWe are skipping these two states in this proposal.
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged