Calculate recent write load in indexing stats #124652

PeteGillinElastic · 2025-03-12T14:49:35Z

This uses the recently-added ExponentiallyWeightedMovingRate class to calculate a write load which favours more recent load and include this alongside the existing unweighted all-time write load in IndexingStats.Stats.

As of this change, the new load metric is not used anywhere, although it can be retrieved with the index stats or node stats APIs.

This uses the recently-added `ExponentiallyWeightedMovingRate` class to calculate a write load which favours more recent load and include this alongside the existing unweighted all-time write load in `IndexingStats.Stats`. As of this change, the new load metric is not used anywhere, although it can be retrieved with the index stats or node stats APIs.

PeteGillinElastic · 2025-03-17T18:01:01Z

server/src/main/java/org/elasticsearch/common/metrics/ExponentiallyWeightedMovingRate.java

            this.startTime = startTime;
-            this.lastTime = 0; // after an increment, this must be positive, so a zero value indicates we're waiting for the first
+            this.lastTime = 0; // should never be used
+            this.waitingForFirstIncrement = true;


We are tweaking this class so that it allows times to be zero. This is technically the correct thing to do, since it is allowed for System.nanoTime() to return zero (though it is vanishingly unlikely!). More importantly, it avoids having to tweak the times used in some tests.

As a result, we can no longer use lastTime == 0 as a marker that we're waiting for the first increment, so we add a boolean to do that job explicitly.

PeteGillinElastic · 2025-03-17T18:01:26Z

server/src/main/java/org/elasticsearch/common/metrics/ExponentiallyWeightedMovingRate.java

    public double calculateRateSince(long currentTime, double currentRate, long oldTime, double oldRate) {
+        if (oldTime < startTime) {
+            oldTime = startTime;
+        }


We need this guard for a few edge cases.

PeteGillinElastic · 2025-03-17T18:03:33Z

server/src/main/java/org/elasticsearch/index/shard/IndexingStats.java

+         *
+         * <p>If this {@link Stats} instance represents multiple shards, this is the average of that ratio for each shard, weighted by
+         * the elapsed time for each shard.
+         */


N.B. This PR is not changing this behaviour, it's just documenting and adding tests for the existing behaviour (since I found it surprising).

PeteGillinElastic · 2025-03-17T18:17:22Z

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

+                    fakeClock.advance();
+                    return IndexingOperationListener.super.preIndex(shardId, operation);
+                }
+            }


Previously, the FakeClock implementation in this test advanced the time on every other call to its getter. This worked only as long as the assumptions about when the getter is called held true, which made it pretty fragile. This PR removes this auto-advance behaviour, and instead explicitly advances the time at the correct point in the indexing operation using a listener, which is much more robust.

elasticsearchmachine · 2025-03-18T12:05:28Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2025-03-18T14:38:19Z

server/src/main/java/org/elasticsearch/index/shard/InternalIndexingStats.java

                if (index.origin().isRecovery() == false) {
                    long took = result.getTook();
                    totalStats.indexMetric.inc(took);
+                    totalStats.recentIndexMetric.addIncrement(took, relativeTimeInNanosSupplier.getAsLong());


This makes me slightly nervous since it has a synchronized block and I think this code is called fairly often. The operations in the block are fast, but it's still a point of synchronization. I think we'll notice in nightly performance tests if it causes any problems though. And I don't see a good way around having a synchronized block in that method.

Perhaps the state in ExponentiallyWeightedMovingRate could be moved into an object and swapped with an atomic ref? I suppose that would create lots of objects, but would remove the need for sync'ing

In my benchmarking, the slowest I ever saw that addIncrement was about 11ns on my machine. So we'd need to be writing millions of documents per second on a node before this started getting significant. Does that seem likely?

I did consider something like what you suggest, Parker. I was concerned about the impact on GC, which my gut feeling would be worse. Though I could be wrong.

I also considered some kind of batching, accumulating the values in a LongAdder before flushing them into the EWMR (it would be fine to increment for e.g. a whole second's worth of docs using the same timestamp) but that adds complexity which I'm hoping we don't need.

masseyke · 2025-03-18T15:21:48Z

server/src/main/java/org/elasticsearch/index/shard/InternalIndexingStats.java

+            logger.debug(
+                "Generating stats for an index shard with indexing time {} sec and active time {} sec giving unweighted write load {} "
+                    + "while the recency-weighted write load is {} using a half-life of {} sec",
+                totalIndexingTimeSinceShardStartedInNanos * 1.0e-9,


You can use TimeValue.timeValueNanos(totalIndexingTimeSinceShardStartedInNanos) to avoid having to have numbers like 1.0e-9 in here (and it automatically use seconds, hours, days, etc, depending on the size of the number, which is convenient. For example, it shows the current half-life as 10000d rather than 864000000 sec).

That's a good idea. Done.

I occurred to me that I should probably also use the overload of logger.debug which takes a Supplier to avoid doing the extra work unless debug logging is enabled. So I've done that, too.

masseyke · 2025-03-18T15:23:04Z

server/src/main/java/org/elasticsearch/index/shard/IndexingStatsSettings.java

+ */
+public class IndexingStatsSettings {
+
+    // TODO: Change this default to something sensible:


Did you want to do this before merging this PR? Or is this just to be determined at some point before we start using this?

I was planning this for the next PR.

masseyke · 2025-03-18T15:38:49Z

server/src/main/java/org/elasticsearch/index/shard/IndexingStats.java

                && totalIndexingTimeSinceShardStartedInNanos == that.totalIndexingTimeSinceShardStartedInNanos
-                && totalActiveTimeInNanos == that.totalActiveTimeInNanos;
+                && totalActiveTimeInNanos == that.totalActiveTimeInNanos
+                && recentIndexingLoad == that.recentIndexingLoad;


This ~~needs~~ probably ought to be in hashCode() if it's going to be in equals() (although I seriously doubt either one is used anywhere other than maybe tests).

It's allowed for hashCode() to only use a subset of the fields used in equals(). It does mean that there'd be a collision making a hash map very slightly less performant if it contained two entries keyed by instances of this which were identical in every field except this one... but I'm not worried about that! As you say, these really shouldn't be used anywhere outside of (maybe) tests.

I opted not to put it in hashCode() because I'm really queasy about calculating the hash code of the result of a floating point computation. I mean, I'm a bit queasy about doing == on it, but figured it might be the right thing to do for some tests. But I drew the line at hashCode().

I'm happy to change this if you feel strongly, e.g. because it might make readers stop of think more than just including it. Since we don't think it's used, I don't feel strongly.

I realized after i wrote the comment that it was harmless. Yeah I'm also uneasy with it being in equals. But since I really don't think either one is ever used, I don't feel strongly about either of them.

masseyke

Looks good to me. I think some of the synchronization could be removed at a later point, but we can see if it matters for performance at all first.

…d-metric

This uses the recently-added `ExponentiallyWeightedMovingRate` class to calculate a write load which favours more recent load and include this alongside the existing unweighted all-time write load in `IndexingStats.Stats`. As of this change, the new load metric is not used anywhere, although it can be retrieved with the index stats or node stats APIs.

This adds the fields added in elastic/elasticsearch#125521 and elastic/elasticsearch#124652 to the spec.

elasticsearchmachine added the v9.1.0 label Mar 12, 2025

PeteGillinElastic force-pushed the ES-10037-recent-load-metric branch 2 times, most recently from e76c01c to 43b8c6c Compare March 17, 2025 17:55

PeteGillinElastic changed the title ~~Add recent load metric~~ Calculate recent write load in indexing stats Mar 17, 2025

PeteGillinElastic force-pushed the ES-10037-recent-load-metric branch 4 times, most recently from d8896c6 to 1c1c43d Compare March 17, 2025 19:29

PeteGillinElastic added >non-issue :Data Management/Stats Statistics tracking and retrieval APIs labels Mar 18, 2025

PeteGillinElastic added 2 commits March 18, 2025 12:00

--amend

d47d3d8

PeteGillinElastic force-pushed the ES-10037-recent-load-metric branch from 1c1c43d to d47d3d8 Compare March 18, 2025 12:00

PeteGillinElastic marked this pull request as ready for review March 18, 2025 12:05

PeteGillinElastic requested a review from a team as a code owner March 18, 2025 12:05

PeteGillinElastic commented Mar 18, 2025

View reviewed changes

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 18, 2025

masseyke self-requested a review March 18, 2025 14:35

masseyke reviewed Mar 18, 2025

View reviewed changes

PeteGillinElastic added 2 commits March 18, 2025 15:42

Respond to review comment about logging, and make it lazy as well

882aff2

Use Strings.format instead of String.format

f83235f

masseyke approved these changes Mar 18, 2025

View reviewed changes

PeteGillinElastic added 2 commits March 18, 2025 17:00

Merge remote-tracking branch 'upstream/main' into ES-10037-recent-loa…

2440273

…d-metric

Merge remote-tracking branch 'upstream/main' into ES-10037-recent-loa…

3c08c55

…d-metric

PeteGillinElastic enabled auto-merge (squash) March 18, 2025 18:09

PeteGillinElastic merged commit 50e6894 into elastic:main Mar 18, 2025
17 checks passed

PeteGillinElastic deleted the ES-10037-recent-load-metric branch March 18, 2025 19:23

PeteGillinElastic added a commit to elastic/elasticsearch-specification that referenced this pull request Apr 3, 2025

Add new write load metrics to IndexingStats

a59dbc0

This adds the fields added in elastic/elasticsearch#125521 and elastic/elasticsearch#124652 to the spec.

PeteGillinElastic mentioned this pull request Apr 3, 2025

ES-10037 Add new write load metrics to IndexingStats elastic/elasticsearch-specification#4195

Merged

PeteGillinElastic added a commit to elastic/elasticsearch-specification that referenced this pull request Apr 3, 2025

ES-10037 Add new write load metrics to IndexingStats (#4195)

47f94c3

This adds the fields added in elastic/elasticsearch#125521 and elastic/elasticsearch#124652 to the spec.

Calculate recent write load in indexing stats #124652

Calculate recent write load in indexing stats #124652

Uh oh!

Conversation

PeteGillinElastic commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Mar 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeteGillinElastic Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PeteGillinElastic commented Mar 12, 2025 •

edited

Loading

masseyke Mar 18, 2025 •

edited

Loading

PeteGillinElastic Mar 18, 2025 •

edited

Loading