Skip to content

Conversation

@PeteGillinElastic
Copy link
Member

@PeteGillinElastic PeteGillinElastic commented Mar 12, 2025

This uses the recently-added ExponentiallyWeightedMovingRate class to calculate a write load which favours more recent load and include this alongside the existing unweighted all-time write load in IndexingStats.Stats.

As of this change, the new load metric is not used anywhere, although it can be retrieved with the index stats or node stats APIs.

@PeteGillinElastic PeteGillinElastic force-pushed the ES-10037-recent-load-metric branch 2 times, most recently from e76c01c to 43b8c6c Compare March 17, 2025 17:55
@PeteGillinElastic PeteGillinElastic changed the title Add recent load metric Calculate recent write load in indexing stats Mar 17, 2025
@PeteGillinElastic PeteGillinElastic force-pushed the ES-10037-recent-load-metric branch 4 times, most recently from d8896c6 to 1c1c43d Compare March 17, 2025 19:29
@PeteGillinElastic PeteGillinElastic added >non-issue :Data Management/Stats Statistics tracking and retrieval APIs labels Mar 18, 2025
This uses the recently-added `ExponentiallyWeightedMovingRate` class
to calculate a write load which favours more recent load and include
this alongside the existing unweighted all-time write load in
`IndexingStats.Stats`.

As of this change, the new load metric is not used anywhere, although
it can be retrieved with the index stats or node stats APIs.
@PeteGillinElastic PeteGillinElastic force-pushed the ES-10037-recent-load-metric branch from 1c1c43d to d47d3d8 Compare March 18, 2025 12:00
@PeteGillinElastic PeteGillinElastic marked this pull request as ready for review March 18, 2025 12:05
@PeteGillinElastic PeteGillinElastic requested a review from a team as a code owner March 18, 2025 12:05
this.startTime = startTime;
this.lastTime = 0; // after an increment, this must be positive, so a zero value indicates we're waiting for the first
this.lastTime = 0; // should never be used
this.waitingForFirstIncrement = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are tweaking this class so that it allows times to be zero. This is technically the correct thing to do, since it is allowed for System.nanoTime() to return zero (though it is vanishingly unlikely!). More importantly, it avoids having to tweak the times used in some tests.

As a result, we can no longer use lastTime == 0 as a marker that we're waiting for the first increment, so we add a boolean to do that job explicitly.

public double calculateRateSince(long currentTime, double currentRate, long oldTime, double oldRate) {
if (oldTime < startTime) {
oldTime = startTime;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this guard for a few edge cases.

*
* <p>If this {@link Stats} instance represents multiple shards, this is the average of that ratio for each shard, weighted by
* the elapsed time for each shard.
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. This PR is not changing this behaviour, it's just documenting and adding tests for the existing behaviour (since I found it surprising).

fakeClock.advance();
return IndexingOperationListener.super.preIndex(shardId, operation);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the FakeClock implementation in this test advanced the time on every other call to its getter. This worked only as long as the assumptions about when the getter is called held true, which made it pretty fragile. This PR removes this auto-advance behaviour, and instead explicitly advances the time at the correct point in the indexing operation using a listener, which is much more robust.

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke masseyke self-requested a review March 18, 2025 14:35
if (index.origin().isRecovery() == false) {
long took = result.getTook();
totalStats.indexMetric.inc(took);
totalStats.recentIndexMetric.addIncrement(took, relativeTimeInNanosSupplier.getAsLong());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me slightly nervous since it has a synchronized block and I think this code is called fairly often. The operations in the block are fast, but it's still a point of synchronization. I think we'll notice in nightly performance tests if it causes any problems though. And I don't see a good way around having a synchronized block in that method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the state in ExponentiallyWeightedMovingRate could be moved into an object and swapped with an atomic ref? I suppose that would create lots of objects, but would remove the need for sync'ing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my benchmarking, the slowest I ever saw that addIncrement was about 11ns on my machine. So we'd need to be writing millions of documents per second on a node before this started getting significant. Does that seem likely?

I did consider something like what you suggest, Parker. I was concerned about the impact on GC, which my gut feeling would be worse. Though I could be wrong.

I also considered some kind of batching, accumulating the values in a LongAdder before flushing them into the EWMR (it would be fine to increment for e.g. a whole second's worth of docs using the same timestamp) but that adds complexity which I'm hoping we don't need.

logger.debug(
"Generating stats for an index shard with indexing time {} sec and active time {} sec giving unweighted write load {} "
+ "while the recency-weighted write load is {} using a half-life of {} sec",
totalIndexingTimeSinceShardStartedInNanos * 1.0e-9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use TimeValue.timeValueNanos(totalIndexingTimeSinceShardStartedInNanos) to avoid having to have numbers like 1.0e-9 in here (and it automatically use seconds, hours, days, etc, depending on the size of the number, which is convenient. For example, it shows the current half-life as 10000d rather than 864000000 sec).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Done.

I occurred to me that I should probably also use the overload of logger.debug which takes a Supplier to avoid doing the extra work unless debug logging is enabled. So I've done that, too.

*/
public class IndexingStatsSettings {

// TODO: Change this default to something sensible:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to do this before merging this PR? Or is this just to be determined at some point before we start using this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning this for the next PR.

&& totalIndexingTimeSinceShardStartedInNanos == that.totalIndexingTimeSinceShardStartedInNanos
&& totalActiveTimeInNanos == that.totalActiveTimeInNanos;
&& totalActiveTimeInNanos == that.totalActiveTimeInNanos
&& recentIndexingLoad == that.recentIndexingLoad;
Copy link
Member

@masseyke masseyke Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs probably ought to be in hashCode() if it's going to be in equals() (although I seriously doubt either one is used anywhere other than maybe tests).

Copy link
Member Author

@PeteGillinElastic PeteGillinElastic Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's allowed for hashCode() to only use a subset of the fields used in equals(). It does mean that there'd be a collision making a hash map very slightly less performant if it contained two entries keyed by instances of this which were identical in every field except this one... but I'm not worried about that! As you say, these really shouldn't be used anywhere outside of (maybe) tests.

I opted not to put it in hashCode() because I'm really queasy about calculating the hash code of the result of a floating point computation. I mean, I'm a bit queasy about doing == on it, but figured it might be the right thing to do for some tests. But I drew the line at hashCode().

I'm happy to change this if you feel strongly, e.g. because it might make readers stop of think more than just including it. Since we don't think it's used, I don't feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized after i wrote the comment that it was harmless. Yeah I'm also uneasy with it being in equals. But since I really don't think either one is ever used, I don't feel strongly about either of them.

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think some of the synchronization could be removed at a later point, but we can see if it matters for performance at all first.

@PeteGillinElastic PeteGillinElastic enabled auto-merge (squash) March 18, 2025 18:09
@PeteGillinElastic PeteGillinElastic merged commit 50e6894 into elastic:main Mar 18, 2025
17 checks passed
@PeteGillinElastic PeteGillinElastic deleted the ES-10037-recent-load-metric branch March 18, 2025 19:23
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Mar 21, 2025
This uses the recently-added `ExponentiallyWeightedMovingRate` class
to calculate a write load which favours more recent load and include
this alongside the existing unweighted all-time write load in
`IndexingStats.Stats`.

As of this change, the new load metric is not used anywhere, although
it can be retrieved with the index stats or node stats APIs.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
This uses the recently-added `ExponentiallyWeightedMovingRate` class
to calculate a write load which favours more recent load and include
this alongside the existing unweighted all-time write load in
`IndexingStats.Stats`.

As of this change, the new load metric is not used anywhere, although
it can be retrieved with the index stats or node stats APIs.
PeteGillinElastic added a commit to elastic/elasticsearch-specification that referenced this pull request Apr 3, 2025
PeteGillinElastic added a commit to elastic/elasticsearch-specification that referenced this pull request Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Stats Statistics tracking and retrieval APIs >non-issue Team:Data Management Meta label for data/management team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants