Skip to content

Comments

elastic: add prirep and index tags to the elasticsearch.shards metric#22426

Open
VJean wants to merge 14 commits intomasterfrom
jean.vintache/detailed-shards-metric
Open

elastic: add prirep and index tags to the elasticsearch.shards metric#22426
VJean wants to merge 14 commits intomasterfrom
jean.vintache/detailed-shards-metric

Conversation

@VJean
Copy link
Member

@VJean VJean commented Jan 27, 2026

What does this PR do?

Adds the detailed_shard_metrics option to get more details on the elasticsearch.shards metric.

When enabled (and cat_allocation_stats is enabled too), adds index and prirep (primary/replica) tags to the elasticsearch.shards metric. This provides detailed shard placement visibility.

Changes:

  • Collect additional data from /_cat/shards
  • Updated tests to verify new tag presence

Example metric series: elasticsearch.shards{node_name:node-1, index:250127-logs, prirep:primary}

Motivation

We need to monitor shard placement distribution across Elasticsearch clusters to better detect:

  • Shard imbalance between nodes
  • Uneven primary/replica distribution
  • Per-index shard skew

The existing aggregated metric only showed total shards per node, making it impossible to identify which indices were causing imbalance or whether primaries vs replicas were distributed properly. This would help us confim suspicions of hot spot nodes.

Having this level of detail in metrics helps for troubleshooting immediately (visualizing the current state just like with hitting APIs directly), but also during RCA by looking at historical data (which APIs don't provide).

Usage

With these tags, operators can query:

  • sum:elasticsearch.shards{prirep:primary} by {node_name} - primary shard distribution
  • sum:elasticsearch.shards{index:my-index} by {node_name} - specific index placement
  • Total skew analysis across the cluster

Technical details

The _cat/shards API returns all shards in the cluster, along with their state. Here, we are counting the ones that are in two states: STARTED and RELOCATING.

STARTED is straightforward, the shard is assigned to a node and we count it as such.

RELOCATING returns a value for node such as: "source -> ip uid target". In this case, we try parsing this string to count the shard towards the source node. When the relocation is done, it will be STARTED, with node: target, and the timeseries will be updated as they should.

The other states are:

  • INITIALIZING: shard is recovering and will then move to STARTED
  • UNASSIGNED: shard is not assigned to any node

We are skipping these two states in this proposal.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@VJean VJean force-pushed the jean.vintache/detailed-shards-metric branch from bf969f7 to fc8aa6b Compare February 4, 2026 10:45
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

⚠️ Major version bump
The changelog type changed or removed was used in this Pull Request, so the next release will bump major version. Please make sure this is a breaking change, or use the fixed or added type instead.

@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.04%. Comparing base (bbfe5e9) to head (93de93f).
⚠️ Report is 117 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@iliakur iliakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor stuff, left more comments in direct communication

# Map p/r to primary/replica for better readability
prirep = 'primary' if prirep_raw == 'p' else 'replica'

key = (node, index, prirep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine if index and prirep are None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I added a defensive skip in case index or prirep are None + handle the case where a shard is in Relocating state. Will clarify in description.

VJean and others added 2 commits February 11, 2026 15:34
Co-authored-by: Ilia Kurenkov <ilia.kurenkov@datadoghq.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9bca67034

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -0,0 +1 @@
elastic: add `prirep` and `index` tags to the `elasticsearch.shards` metric

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use non-breaking changelog type for this additive feature

This entry is filed as .changed, which (per /workspace/integrations-core/AGENTS.md) is reserved for breaking/significant changes and triggers a major version bump; this PR adds an opt-in config/metric-tagging feature and does not introduce a breaking change, so keeping this file type will incorrectly force a major release for elastic. Please regenerate the changelog entry with an appropriate non-breaking type (for example added).

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants