Skip to content

Conversation

@kingherc
Copy link
Contributor

@kingherc kingherc commented Jul 28, 2025

And deprecate old style getEngine/OrNull methods.

Apply new functionality to several methods related to non-accurate metrics that do not need to wait for the engine being reset and can do with a null engine. These pertain typically to periodic operations that can skip a shard being reset and revisit it next time.

Relates ES-11457

@kingherc kingherc self-assigned this Jul 28, 2025
@kingherc kingherc added >non-issue :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Indexing Meta label for Distributed Indexing team v9.2.0 labels Jul 28, 2025
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jul 28, 2025
@kingherc kingherc force-pushed the enhancement/ES-12215-with-engine-null-reset branch 2 times, most recently from 4593d64 to b5f0212 Compare July 28, 2025 11:03
And deprecate old style getEngine/OrNull methods.

Apply new functionality to several methods that do not need to
wait for the engine being reset and can do with a null engine.
These pertain typically to periodic operations that can skip
a shard being reset and revisit it next time.

Also return empty stats for a few stats in case the engine is
being reset. These are stats that are already returned empty
from a hollow engine.

Relates ES-11457
@kingherc kingherc force-pushed the enhancement/ES-12215-with-engine-null-reset branch from b5f0212 to ff5b13a Compare July 28, 2025 11:07
@kingherc kingherc marked this pull request as ready for review July 28, 2025 13:05
@kingherc kingherc requested review from fcofdez and tlrx July 28, 2025 13:05
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the assumptions that we're making on this patch, namely that the reset engine would be always from InternalEngine -> Hollow where returning empty stats or discarding a flush request is acceptable because they're essentially no-ops. But this might not be true anymore if we end up adopting resetEngine for other engine implementations.

@kingherc
Copy link
Contributor Author

@fcofdez do you hint we should maybe forego these changes at the moment? It might mean that some management threads or the Disk/Memory controller threads are temporarily stuck during massive resets, but it's not too bad either I guess (until we improve the situation ultimately). cc @tlrx for your opinion as well.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit mitigated about this change, I have the impression that we're tackling different problems with the new withEngineOrNullIfBeingReset.

As far as I understand, we have 3 situations:

  • stats or metrics that need to be reported while the engine is reset (ex: getWritingBytes(), flushStats(), indexingStats()`), which are mostly already best-effort in term of accuracy
  • actions that can be skipped during reset (flushOnIdle), which can be discussed for each case
  • actions that must be performed on a non-null engine (flush with waitIfOngoing=true, trimTranslog), which are the trickier ones to handle specially they are called on transport thread

I think we can craft something for the stats/metrics case (possible solutions could be to keep a copy of the stats for the time of the reset, or keep a reference on the engine-to-be-reset). For the skippable actions, I think something like you did in withEngineOrNullIfBeingReset can work (though I would call this tryWithEngineOrNull). For the non-skippable actions I think we need to find a solution for each case.

Happy to discuss this more

@kingherc kingherc changed the title Add withEngineOrNullIfBeingReset Add tryWithEngineOrNull Jul 30, 2025
@kingherc kingherc requested review from fcofdez and tlrx July 30, 2025 16:32
@kingherc
Copy link
Contributor Author

@tlrx and @fcofdez , updated PR to only keep the parts related to the non-accurate metrics. Feel free to review again this PR and the serverless one.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left some comments.

flushMetric.count(),
periodicFlushMetric.count(),
TimeUnit.NANOSECONDS.toMillis(flushMetric.sum()),
engine != null ? engine.getTotalFlushTimeExcludingWaitingOnLockInMillis() : 0L
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a note for myself, we could extract some stats at the shard level and pass them down to the engine instances so that they "survive" resets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, was thinking also of that, although there may be some concurrency challenges to handle. Also, since we will soon shorten the reset period drastically, I am worried about doing any more effort on this front, including this PR (which could arguable be skipped if we didn't have long resets). However, such efforts might be useful in the future if resets become long again or there may be other engines being reset.

Copy link
Contributor Author

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @tlrx ! Feel free to review again. Gentle reminder for @fcofdez to review as well.

flushMetric.count(),
periodicFlushMetric.count(),
TimeUnit.NANOSECONDS.toMillis(flushMetric.sum()),
engine != null ? engine.getTotalFlushTimeExcludingWaitingOnLockInMillis() : 0L
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, was thinking also of that, although there may be some concurrency challenges to handle. Also, since we will soon shorten the reset period drastically, I am worried about doing any more effort on this front, including this PR (which could arguable be skipped if we didn't have long resets). However, such efforts might be useful in the future if resets become long again or there may be other engines being reset.

@kingherc kingherc requested a review from tlrx August 1, 2025 12:09
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kingherc kingherc merged commit feaa580 into elastic:main Aug 1, 2025
33 checks passed
@kingherc kingherc deleted the enhancement/ES-12215-with-engine-null-reset branch August 1, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Indexing Meta label for Distributed Indexing team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants