-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add tryWithEngineOrNull #132000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tryWithEngineOrNull #132000
Conversation
4593d64 to
b5f0212
Compare
And deprecate old style getEngine/OrNull methods. Apply new functionality to several methods that do not need to wait for the engine being reset and can do with a null engine. These pertain typically to periodic operations that can skip a shard being reset and revisit it next time. Also return empty stats for a few stats in case the engine is being reset. These are stats that are already returned empty from a hollow engine. Relates ES-11457
b5f0212 to
ff5b13a
Compare
|
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
…5-with-engine-null-reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about the assumptions that we're making on this patch, namely that the reset engine would be always from InternalEngine -> Hollow where returning empty stats or discarding a flush request is acceptable because they're essentially no-ops. But this might not be true anymore if we end up adopting resetEngine for other engine implementations.
|
@fcofdez do you hint we should maybe forego these changes at the moment? It might mean that some management threads or the Disk/Memory controller threads are temporarily stuck during massive resets, but it's not too bad either I guess (until we improve the situation ultimately). cc @tlrx for your opinion as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit mitigated about this change, I have the impression that we're tackling different problems with the new withEngineOrNullIfBeingReset.
As far as I understand, we have 3 situations:
- stats or metrics that need to be reported while the engine is reset (ex:
getWritingBytes(),flushStats(), indexingStats()`), which are mostly already best-effort in term of accuracy - actions that can be skipped during reset (
flushOnIdle), which can be discussed for each case - actions that must be performed on a non-null engine (
flushwithwaitIfOngoing=true,trimTranslog), which are the trickier ones to handle specially they are called on transport thread
I think we can craft something for the stats/metrics case (possible solutions could be to keep a copy of the stats for the time of the reset, or keep a reference on the engine-to-be-reset). For the skippable actions, I think something like you did in withEngineOrNullIfBeingReset can work (though I would call this tryWithEngineOrNull). For the non-skippable actions I think we need to find a solution for each case.
Happy to discuss this more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left some comments.
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
| flushMetric.count(), | ||
| periodicFlushMetric.count(), | ||
| TimeUnit.NANOSECONDS.toMillis(flushMetric.sum()), | ||
| engine != null ? engine.getTotalFlushTimeExcludingWaitingOnLockInMillis() : 0L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a note for myself, we could extract some stats at the shard level and pass them down to the engine instances so that they "survive" resets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, was thinking also of that, although there may be some concurrency challenges to handle. Also, since we will soon shorten the reset period drastically, I am worried about doing any more effort on this front, including this PR (which could arguable be skipped if we didn't have long resets). However, such efforts might be useful in the future if resets become long again or there may be other engines being reset.
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
…5-with-engine-null-reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| flushMetric.count(), | ||
| periodicFlushMetric.count(), | ||
| TimeUnit.NANOSECONDS.toMillis(flushMetric.sum()), | ||
| engine != null ? engine.getTotalFlushTimeExcludingWaitingOnLockInMillis() : 0L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, was thinking also of that, although there may be some concurrency challenges to handle. Also, since we will soon shorten the reset period drastically, I am worried about doing any more effort on this front, including this PR (which could arguable be skipped if we didn't have long resets). However, such efforts might be useful in the future if resets become long again or there may be other engines being reset.
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
And deprecate old style getEngine/OrNull methods.
Apply new functionality to several methods related to non-accurate metrics that do not need to wait for the engine being reset and can do with a null engine. These pertain typically to periodic operations that can skip a shard being reset and revisit it next time.
Relates ES-11457